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The present invention relates to methods of operating recon- 
figurable arrays of data processing eLements. 

When using such arrays^ it is desired to optimise the wa^^ the 
array is coupled to other units, e. g- to processor if used as 
a coprocessor and/or to optimise the way in which the arxray is 
configured. 

The present invention* aims at providing improvemertts oveir- the 
prior art. 

It ±3 to be noted that the disclosure of the present invention 
does comprise severaX major parts in Lts description that all 
refer to ways of allowing for an optiiaum use of the array and 
hence are closely related to each other. 

It ±s also to be noted that the parts do comprise a plurality 
of figures that the text relates to however without alway^s 
giving ah exact, precise and correct reference. Yet .any cievia- 
tions from correct referencing will be obvious to. the average 
skilled person. 
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1 Executive Summary 



The study is concerned ivith three qbjectives: 

1. Proposal of a hardware firamework, which enables an efficient integration of the PACT XPP 
core into a standard RISC processor arcliitecture. 



2. Proposal of a compiler for the coupled MSC+XPP hardware. Ttiis compiler decides automati- 
cally which, part of a source code is executed on the RISC processor and which part is eTce- 
cuted on the PACT XPP. 

3. Presentation of a nimiber of case studies demonstrating which results may be achieved by us- 
ing the proposed C Compiler in cooperation with the proposed hardware framework. 

The proposed hardware framework accelerates the XPP core in two respects. First, data throughput is 
increased by raising the XPPs mtemal operating frequency uito the range of the RISC's fi^uency. 
This, however, meaxis that the XPP runs mto the same pit like all high fr-equency processors - memory 
accesses become very slow compared to processor internal computations. This is why the use of a 
cache is proposed. It eases the memory access problem for a large range of algorithms, which are well 
suited for an execution on the XPP. The cache as second throughput increasing feature requires a con- 
troller. Hence a progr^niable cache controller is introduced, which manages the cache contents ajid 
feeds the XPP core. It decouples the XPP core computations from the data transfer so that, for in- 
stance, data preload to a specific cache sector talces place while the XPP is operating on data located in 
a different cache sector. 

Another problem emerging with a coupled RISC+XPP hardware is concerned with the RISCs multi- 
tasking concept. It becomes necessary to mtemxpt computations on the XPP m order to perform a task 
switch. Multitasking is supported by the proposed compiler, as well as by. the proposed hardwaje. 
First, each XPP configuration is considered as an uninterruptible entity. This means that the compiler, 
which generates the configurations, takes care that the execution time of any configuration does not 
exceed a predefined time slice. Second, the cache controller is concerned with the saving and restorLng 
of the XPP's state after an interrupt. The proposed, cache concept minimizes the memory traffic for 
interrupt handling and frequently even allows avoiding memory accesses at all. 

Finally, the proposed cache concept is based on a simple IRAM cell structure allowing for an easy 
scalability of the hardware - extending the XPP cache size, for instance, requires not nauch more ti&an 
the duplication of IRAM cells. 

The study proposes a compiler for a RISC+XPP system.. The objective of the compiler is that re^al- 
world applications, which are written in the C language, can be compiled for a RISC+XPP system. 
The compiler removes the necessity of developing NML code for the XIPP by hand. It is possible, in- 
stead, to implement algorithms in the C language or to directly use existing C applications without 
much adaptation to the XPP system. The proposed compiler includes tlaree major components to per- 
form the compilation process for the XPP: 

1. partitionuig of the C source code into RISC and XPP parts, 



\ 



2. transformations to optimize the code for the XPP and 
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3. generating NML code. 
Finally the generated NML code is placed and routed for the XPP. 

The partitioning component of the compiler decides which parts of an spplication code can be exe- 
cuted on the XPP and which parts are executed on the RISC. Typical candidates for becoming XPP 
code are loops with a large number of iterations whose loop bodies are dominated by arithmetic op- 
erations. The reinaining source code - including the data transfer code - is compiled for the RISC. 

The proposed compiler transforms the XPP code such that it is optimize^d for NML code generation. 
The transformations mcluded in the compiler comprise a large number oF loop transformations as well 
as general code ixansformations. Together with data and code analysis -the compiler restructures the 
code so that it fit's into tiie XPP array and that the final performance exceeds the pure RISC perform- 
ance. Finally the compiler generates NML code from the transformed program. The whole compilation 
process is controlled by an optinuzation driver which selects the optimal order of transformations 
based on the source code. 

The case studies build a major aspect of tiie study. The selection of the examples is conducted by the 
guiding principle that each example stands for a set of typical real-world applications. For each exam- 
ple the study demonstrates the work of the proposed compiler. First the code is partitioned. The code 
transformations, which are done by the compiler, are shown and explained. Some examples requke 
minor source code transformaticAr^^hich must be performed by hand. The study argues that these 
transformations are either too expensive, or too specific to make sense to be included in the proposed 
compiler. Dataflow graphs of the transformed codes are constmcted for e^h example, which are used 
by die compiler to generate the NML code, hi addition the XPP resource vssages are shown. 

The case studies demonstrate that a compiler containing the proposed tn-ansformations can generate 
efficient code from numerical applications for the XPP. This is possible b^ause the compiler relies on 
the features ofthe suggested hardware, like the cache controller. 
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2 Hardware 



2.1 Design Parameter Changes 



Since the XPP core shall be integrated as a functional unit into a standard RISC core, some system 
parameters have to be reconsidered: 



RISC instructions of totally different type (Ld/St, ALU, MuL/Div/MAC, FPALU, FPlMuL..) are exe- 
cuted in separate specialized functional iinits to increase the fraction of silicon Aat is busy on average. 
Such functional unit separation tias led to superscalar RISC diesigns, that exploit higher levels of par- 
allelism. 

Each functional unit of a RISC core is highly pipelined to improve throughput. Pipeliaing overlaps the 
execution of several instructions by splitting iliem into unrela-ted phases, which are executed in differ- 
ent stages of the pipeline. Thus different stages of consecutive instructions can be executed in parallel 
with each stage taking much less time to execute. This allows higher core frequencies. 

Since the pipelines of all functional units are approximatel3^ subdivided into sub-operations of the 
same size (execution time), these functional units / pipelines execute in a highly syiicluronous manner 
with complex floating pomt pipelines being the exception. 

Since the XPP core uses data fIov^? computation, it is pipelined by design. However, a. single configu- 
ration usually implements a loop of the application, so the configuration remains active for many cy- 
cles, unlike tiie instructions in every other fimctional unit, which typically execute foi: one or two cy- 
cles at most. Therefore it is still worthwhile to consider the separation of several phases (e.g.: Ld / Ex 
/ Store) of an XPP configuration (7= XPP instruction) in-to several functional units to unprove 
concurrency via pipelining on this coarser scale. This also innproves throughput and response time in 
conjunction with multi tasking operations and implementations of simultaneous multitlireading (SMT). 

The multi cycle execution time also forbids a strongly synchronous execution scheme and rather leads 
to an asynchrpnous scheme, like for e.g. floating point square root units. This in turn necessitates the 
existence of explicit synchronizatioa instructions. 



As a functional unit, the XPP's operating frequency will either be half of the core frequency or equal 
to the core frequency of the RISC. Almost every RISC core currently on the market e^cceeds its mem- 
ory bus frequency with its core frequency by a larger factor. Therefore caches are employed, forming 
what is commonly called the memory hierarchy: Each layer of cache is larger but slower than its 
predecessors. 



2.1 .1 Pipelining / Concurrency / Synclironicity 



2A2 Core frequency / Memory l^ierarcliy 
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This memory hierarchy does not help to speed up computationis which shuffle large amoiints of data, 
with little or no data reuse. These computations are called ''bounded by memory bandwidth". However 
other types of computations with more data locality (another name for data reuse) gain performance as 
long as they fit into one of the upper layers of the memory hierarchy. This is tiie class of applications 
that gain the highest speedups when a memory hierarchy is introduced. 

Classical vectorization can be used to transform memory-bounded algorithms, with a data set too big 
to fit into the upper layers of the memory hierarchy. Rewriting the code to leuse smaller data sets 
sooner exposes memory reuse on a smaller scale. As tiie new data set size is chosen to fit into the 
caches of the memory hierarchy, the algorithm is not memory boimded any more, yielding significant 
speed-ups. 

2.1.3 Software /Multitasking Operating Systems 

As the XPP is introduced into a RISC core, the changed euA^ironment - higher frequexicy and the 
memory hierarchy — not only necessitate reconsideration of hardware design parameters, but also a 
reevaluation of the software environment 

Memory Hierarchy 

The introduction of a memory hierarchy enhances the set of applications that can be implemented effi- 
ciently. So far the XPP has mostly been used for algorithms tha-t read their data sets in a luaear manner, 
applying some calculations in a pipelined fashion and writing tJie data back to memory. A^s long as all 
of the computation fits into the XPP array, these algorithms are memozy bounded. Typical applications 
^re filtering and audio signal processing in general. 

But there is another set of algorithms, that have even higher computational complexity^ and higher 
memory bandwidth requirements. .Examples are picture and ^ideo processing, where a second and 
third dimension of data coherence opens up. This coherence is e.g. exploited by picture and video 
compression algorithms, that scan pictures in both dimensions -to find similarities, even searching con? 
sebutive pictures of a video stream for analogies. Naturally these algorithms have a much liigher algo- 
rithmic complexity as well as higher memory requirements. Yei: they are data local, either l)y design or 
they can be transfonned to be, thus efficiently exploiting the memory hierarchy and the higher clock 
firequencies of processors with memory hierarchies. 

li/lulti Tasking 

The introduction into a standard RISC core makes it necessary to understand and support the needs of 
a multitasking operating system, as standard RISC processors are usually operated in multitasking 
environments. With multitasking, the operating system switches the executed application on a regular 
basis, thus simulating concurrent ejcecution of several applications (tasks). To switch taslcs, the oper- 
ating system has to save the state (e.g. the contents of all registers) of the running task and then reload 
the state of another task. Henc0 it is necessary to determine what the state of the processor is, and to 
keep it as small as possible to allow efBcient context switches. 

Modem microprocessors gain their performance fi-om muhiple specialized and deeply pipelined fijnc- 
tional units and high memory hierarchies, enabling high core frequencies. But high memory hierar- 
chies mean that there is a high penalty for cache misses due to the difference between core and mem- 
oiy frequency. Many core cycles pass until the values are finally available from memoiy. Deep pipe- 
lines incur pipeline stalls due to data dependencies as well as bxanch penalties for mispredicted condi- 
tional branches. Specialized fiinctional units like floatmg poimt units idle for integer-onl^^ programs. 
For these reasons, average functional unit utilization is much too low. 
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The newest development with RISC processors, Simultaneous MultiThreading (SMI), adds hardware 
support for a finer granularity (instruction / functional unit level) switching of tasks, exposing more 
than one independent instruction stream to be executed. Thus, whenever one instruction stream stalls 
or doesn't utilize all functional units, the other one can jump in. This improves functional unit utiliza- 
tion for today's processors. • 

With SMT, the task (process) switching is done in. hardware, so the processor state has to be dupli- 
cated in hardware. So again it is most efficient to keep the state as small as possible. For the combina- 
tion of the PACT XPP and a standard RISC processor, SMT is very beneficial, since the XPP configu- 
rations execute longer than the average RISC instruction. Thus another task can utilize the other func- 
tional units> while a configuration is running. On the other side, not every task: will utilize the XPP, so 
while one such non-XFP task is running, another one will be able to use the XPP core. 



2.2 Communication Between the RISC Core and the 
XPP Core. 

In the following section introduces several possible liardware implementations for accessing memory. 

2^2.1 Streaming 

Since streaming can only support (number__ofIO__ports * width_of_IO_port) "bits per cycle, it is only 
well suited for small XPP arrays with heavily pipelined configurations that fea.ture few inputs and out- 
puts. As die pipelines take a long time to fill and empty while the running tirne of a configuration is 
limited (as described under "'context switches"), ttiis type of conununicationt does not scale well to 
bigger XPP arrays and XPP frequencies near the RISC core frequency. 

■ Streaming from the RISC core 

In this setup, the RISC supplies the XPP array with the streaming data. Since the RISC core^ 
has to execute several instructions to compiite addresses and load an item from memory, this 
setup is only suited, if the XPP core is reading data with a frequenoy much lower than the 
RISC core frequency. 

■ Streaming via DMA 

In this mode the RISC core only initializes a. DMA channel which then, supplies the data items 
to the streaming port of the XPP core. 

222 Shared Memory (Main Memory) 

In this configuration the XPP array configuration uses a number of PAEs to generate an address that is 
used to access main memory through the lO ports. As the number of lO ports is very limited this ap- 
proach suffers from the same limitations as llie pre^vious one, although for lar-ger XPP arrays there is 
less impact of using . PAEs for address generation. However Ails approach is still useful for loading 
values from very sparaeLvectors- 



2.2.3 Shared R/lemory (IRAM) 

This data access mechanism uses the IRAM elements to store data for local computations. The IRAMs 
can either be viewed as vector registers or as local copies of main memory. 
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There are several ways to fill the IRAMs witfi data. 

1. The IRAMs are loaded in advance by a separate configuration using streaming. 

This method can be implemented with the current XPP architecture. The I3RAMs act as vector 
registers. As explicated above, this will limit Hxe performance of the XPP array, especially as 
the IRAMs will always be part of the externally^ visible state and hence mxist be saved and re- 
stored on context sv^tches. 

2. The IRAMs can be loaded in advance by separate load-instructions. 

This is suniliar to tbe first mettiod. Load-instruc^tions are implemented in tiardware which load 
the data into the IRAMs. The load-instructions can be viewed as hard coded load- 
configuration. Therefore configuration reloads are reduced. Additionally, the special load- 
mstructions may usie a wider interface to the memory hierarchy. Therefore a more efBcient 
method than streaming can be used. 

3. The IRAMs can be loaded by a "burst preload from memory" instractioan of Ae cache con- 
troller. No configuration or load-instruction is needed on the XPP. The IRAM load is imple- 
mented in the cache controller and triggered by the RISC processor. But th.e IRAMs still act as 
vector registers and are therefore mcluded in the externally visible state. 

4. The best mode however is a combination of the previous solutions witb. the extension of a 
cache: 

A preload instroction maps a specific memory area defined by starting address and size to an 
IRAM. This triggers a (delayed, low priority) bxu-st load fi^m the memory hierarchy (cache). 
After all IRAMs are mapped, the next configuration can be activated. The activation incurs a 
wait until all burst loads are completed. However, if the preload instnictions are issued long 
enough in advance and no interrupt or task switch destroys cache locality, the wait will not 
consume any time. 

To specify a memory block as output-only IRAJVI, a "preload clean" instraction is used, which 
avoids loading data fi"om memory. The "preload clean" instruction just indicates the IRAM for 
write back. 

A synchronization instruction is needed to malice sure that ttie content of a specific memory 
area, which is cached in IRAM, is written back to the memory hieraichy. This can be done 
globally (full write back), or selectively by specifying the memory area^ which will be ac- 
cessed. 

2.3 State ofthe XPP Core 

As described in the previous section, the size of the state is crucial for the efficiency of context 
s>vitches. However, although the size of the state, is fixed for the XPP core, it depends on the declara- 
tion ofthe various state elements, whether they have to be saved or not. 

the state ofthe XPP core can be classified as 

1 Read only (instruction data) 

. configuration data, consisting of PAE configuration and routing configuration data 

2 Read -Write 

■ thexontents ofthe data registers and latches of ttie PAEs, which are driven onto the busses 

■ the contents ofthe ERAM elements 
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2.3.1 Limiting Memory Traffic 



There are several possibilities to limit the amount of memory traffic during context switches. 

Do not save read-only data 

Hiis avoids storing configuration data, since configuration data is read only- The current configuration 
is simply overwritten by the new one. 

SaveJessjdata- 

If a configuration is defined to be uninterruptible (non pre-emptive), all of the local state on the busses 
and in the PAEis can be declared as scratch. This means that every configuration gets its input data 
from the IRAMs and writes its output data to the IRAMs. So after the configuration has finished all 
information in the PAEs and on the buses is redundant or invalid and does not have to be saved. 

Save modified data only 

To reduce the amount of R/W data, which has to be saved, we need to keep track of the modificatiori 
state of the different entities JThis-mcurs a silicon area penalty for the additional "dir^' bits. 

Use caching to reduce the memory traffic 

The configuration manager handles manual preloading of configurations. Preloading will help in par- 
allelizing the memoiy transfers with otiier computations during the task: sxvitch. This cache can also 
reduce die memory traffic for frequent context switches, provided that a Least Recently Used (LRXT) 
replacement strategy is implemented in addition to the preload mechanism. 

The IRAMs can be defined to be local cache copies of main memory as proposed as fourth method in 
section 2.2.3. Then each IRAM is associated with, a starting address and modification state informa- 
tion. The IRAM memory cells are replicated An IRAM PEA contains an IRAM block with multiple 
IRAM instances. Only the starting addresses of thie IRAMs have to be saved and restored as contextz. 
The starting addresses for the IRAMs of the current configuration select the IRAM instances witti 
identical addresses to be used. 

If no address tag of an IRAM instance matches the address of the newly, loaded context, the corre- 
sponding memory area is loaded to an einpty IRAM instance. 

If no empty IRAM instance is available, a clean Cumnodified) instance is declared empty (and hence 
must be reloaded later on). 

If no clean IRAM instaace is available, a modified (dirty) instance is clearxed by writing its data bade 
to main memory. This adds a certain delay forthe wite back. 

This delay can be avoided, if a separate state machine (cache controller) tries to clean inactive IRAN^^ 
iiistances by using unused memory cycles to write back the IRAM instances' contents. 



2.4 Context Switches 



Usually a processor is viewed as executing a single stream of instructions- But today's multi taskiag 
operating systems support hundreds of tasks being executed on a single processor. This is achieved b^^ 
switching contexts, where all, or at least the most relevant parts of the processor state, which belong to 
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the cuirent task - the task's context - is exchanged with the state of anotlier task, that >vill be executed 
next 

There are three types of context switches: switching of virtual processors with simultaneous mmilti- 
threading (SMT, also known as HyperThreading), execution of an Interrupt Service Routine (ISR) and 
a Task Switch. 



2A1 SMT Virtual Processor Switch 

This type of context switch is executed without software interaction, totally in hardware. Instructions 
of several instruction streams are merged into a single instruction stream to increase instruction L^vel 
parallelism and improve functional unit utilization. Hence tfie processor state cannot be stored to and 
reloaded bom memory between instructions from different instruction streams: Imagine the worst «ase 
of alternating instructions.from two streams and the hundreds to thousanid of cycles needed to write the 
processor state to memory and read in another state. 

Hence hardware designers have to replicate the internal state for ever-y virtual processor. Eveiy in- 
struction is executed within the context (on the state) of the virtual processor, whose program coitnter 
was used to fetch the instruction. By replicating the state, only the multiplexers, which have to be m- 
serted to select one of the different states, have to be switched. . 

Thus the size of the state also increases the silicon area needed to implement SMT, , so the size of the 
state is crucial for many design decisions. 

2.4.2 Interrupt Service Routine 

This type of context switoh is handled partially by hardware and partially by software. All of the state 
modified by the ISR has to be saved on entry and must be restored on exdt 

The part of the state, which is destroyed by the jump to the ISR, is sa-ved by hardware (e.g. the pro- 
gram counter). It is the ISR's responsibility to save and restore the state of all other resources, tha* are 
actiially used within the ISR 

The more state information to be saved, the slower the interrupt response time will be and the greater 
the performance impact will be if external events trigger interrupts at a high rate. 

The execution model of the instructions will also affect the tradeoff between short interrupt lateracies 
and maximum throughput: Throughput is maximized if the instnictioris in the pipeline are finished, 
and the instmctions of the ISR are chained- This adversely affects the interrupt latency. If, however, 
the instructions are abandoned (pre-empted) in &vor of a short intemipt latency, they must be fetched 
again later, which affects throughput The third possibility would be to save the mtemal state odf the 
instructions within the pipeline, but this requires too much hardware eff<>rt Usually this is not done. 

2.4.3 Task Switch 

Hiis type of context switch is executed totally in software. All of a trask's context (state) has to be 
saved to memory, and the context of the new task has to be reloaded. Since tasks are^usually allowed 
to use all of the processor's resources to achieve top performance, all of the processor state has to be 
saved and restored; If the. amount of state is excessive, the rate of context switches must be decresased 
by less frequent rescheduling, or a severe throughput degradation will result, as most of the time will 
be spent in saving and restoring task contexts. This in turn increases th& response time for the tasks. 
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2.5 A Load Store Architecture 



We propose an XPP integration as an asynchronously pipelined functional unit for the RISC. We fur- 
ther propose an explicitly preloaded cache for the IRAMs, on top of the memory hierarchy existing 
within the RISC (as proposed as fourth method in section 2.2 J). Additionally a de-centralized explic- 
itly preloaded configuration cache within the PAE arrax is employed to support preloaduig of configu- 
rations and fast switching between configurations. 

Since the IRAM content is an explicitly preloaded menciory area, a virtually ualimited number of such 
IRAMs can be used. They are identifi^ by their memory address and their size. The IRAM content is 
explicitly preloaded by the application. Caching will increase performance by reusing data from the 
inemory hierarchy. The cached operation also eliminates the need for explicit store instructions; they 
are handled implicitly by cache write back operations "but can also be forced to synchronize with the 
RISC. 

The pipeline stages of the XPP functional unit are Load, Execute and Write Bauck (Store). The store is 
executed delayed as a cache write back. The pipeline stages execute in an asynchronous fashion, thus 
. hiding the variable delays from the cache preloads and the PAE.array. 

The XPP functional unit is decoupled of the RISC by a. FIFO, which is fed with the XPP instructions. 
At the head of this FIFO, the XPP PAE consumes and executes the configurations and the preloaded 
IRAMs. Synchronization of the XPP and the RISC is done explicitly by a synchronization instruction. 

Instructions 

In the following we define the instruction formats nee<ied for the proposed architecture. We use a C 
style prototype definition to specify data types. All instructions, except the XPPSync instruction exe- 
cute asynchronously. The XPPSync instruction can be iK.sed to force synchronization. 

XFPPreloadConfig (void "^ConfigurationStartAddress) 

The configuration is added to the preload FIFO to be loaded into the configuration cache within the 
PAE array- 
Mote that speculative preloads are possible, since succ^essive preload commairds overwrite the previ- 
ous. 

The parameter is a pointer register of the RISC pointer register file. The size is implicitly contained ui 
the configuration. 

XPPPreload (int IRAM, void *StartAddress, iht Size) 
XPPPrelbadClean (int IRAM, void *StartAddress, int Size) 

This instruction specifies the contents of the IRAM for the next configuration execution. In fact, the 
memoiy area is added to the preload FIFO to be loaded into the specified niAM.. 

The first parameter is the IRAM number. This is an immediate (constant) value. 

The second parameter is a pointer to the starting address. This parameter is provided in a pointer reg«- 

ister of the RISC pointer register file. 

The third parameter is the size in units of 32 bit words. This is an integer value. It resides in a general- 
purpose register of the RISC's mteger register file. 

The first variant actually preloads the data from meraor^^. 

The second variant is for wrrite-only accesses. It skips tbe loading operation. Tbius no cache misses can 
occur for this IRAM. Only the address and size are denned. They are obviously needed for the write 
back operation of the IRAM cache. 
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Note that speculative preloads are possible, since successive preload commands to the same IRAM 
overwrite each other (if no configuration is executed inbetween). Thus only the last preload command 
is actually effective, when the configuration is executed. 

XPPExecute 0 

This instruction executes the last preloaded configuration with the last preloaded IRAM contents. Ac- 
tually a corifiguration start command is issued to the FIFO. Then the FIFO is advanced; this means that 
further preload commands will specify the next configuration or parameters for the next coim£iguration. 
Whenever a configuration finishes, the next one is consumed from the head of the FIFO, if its start 
command has already been issued. 

XPFSync <void *StartAddress, intSize) 

This instrtiction forces write back operations for all IRAMs that overlap the given memory area. If 
overlapping IRAMs are still in use by a configuration or preloaded to be used, this operation will 
block. Giving an address of NULL (zero) and a size of MAK^LNT (bigger than the actual memory), 
this instruction can also be used to wait until all issued configmutions finish. 
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2.5.1 A Basic Implementation 
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' Figure 1 : Memoiy intex&ce 
The XPP core shares tiie memqiy hierarchy with the RISC core using a special cache controller. 



XPPPreloadConngC XppCfg_foo ); 

for i\nt i*0; i <1000; { 

XPPPreIoad( 2. &a[i*30], 30 ); 

XPPPreloadC 0, &b(i*200]. 200 ); 

XPPPreloodClean( 5,&cli*103, 10); 

XPPExecute( ); 
/• 

Other RISC computations ... 

In the meanwhile, the burst preloads and 

the previous configuration are running; 

The new configuration is executed as soon 

as the preloads and the previous 

configuration are finished. 

New burst preloads can be issued 

according to the FIFO length. 

V 

} 

Note: in all places where constants ore used, 
the value should actually come from a reg ister 



Le gend: 



per thread state resource 



yb|atjle- (noi^'^tat^^^^^ 

wntebircik-.if dirtv 



volatile: read only resource 




Figure 2 IRAM & configuration cache controller dat^ structures and usage example 
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The preload-FIFOs in the above figure contain the addressses and sizes for aheady issued IRAM pre- 
loads, exposing them to the XPP cache controller. The FIFOs have to be duplicated for every virtual 
processor in an SMT environment Tag is the typical tag for a cache line containing starting address, 
size and state {empty I clean I dirty I in-use). The additional in-use state signals usage by the current 
configuration. The cache controller cannot manipulate these IRAM instances. 

The execute configuration command advances all preload FIFOs, copying the old state to the newly 
created entry. This way the following preloads replace the previously used IRAMs and configurations. 
If no preload is issued for an IRAM before the configuration is executed, the pr&load of the previous 
configuration is retained. Therefore it is not necessary "to repeat identical preloads for an IRAM in 
consecutive configurations. 




Each configiuation's execute command has to be delayed (stalled) until all necessary preloads are 
finished, either explicidy by the use! of a synchronization, command or implicitly by the cache control- 
ler. Hence the cache controller (XPP.Ld/St unit) has to liahdle the synchronization and execute com*^ 
naands as well, actually starting the configumtion as soon as all data is ready. Afirer the termination of 
the configuration, dirty IRAMs are written back to memory as soon as possible, if their content is not 
reused in the same IRAM. Therefore the XPP PAE array and the XPP cache controller can be seen as a 
single unit since they do not have different instruction streams: rather, the cache controller can be seen 
as the configuration fetch (CF), operand fetch (OF) (IRAM preload) and write back (WB) stage of the 
XPP pipeline, also triggering ^e execute stage (EX) (PAE array). 

Due to the long latencies, and their non-predictability (cache misses, variable le:iigth configurations), 
the stages can be overlapped several configurations wide using the configuration and data preload 
FIFO ("^pipeline) for loose couplmg. So if a configumtion is executing and the data for the next has 
already been preloaded, the data for the next but one configuration is preloaded: These preloads can be 
speculative; the amount of speculation is the compiler's tjade-ofif. The reasonable length of the preload 
FIFO can be several configurations; it is limited by diminishing returns, algoirithm properties, the 
compiler's ability to schedule preloads early and by silicon usage due to the IRA^ duplication factor, 
which has to be at least as big as the FIFO length. Due to this loosely coupled operation, the inter- 
locking - to avoid data hazards between IRAMs - cannot: be done optimally by software (scheduling), 
but has to be enforced by hardware (hardware interlockiJig). Hence the XPP cactie controller and the 
XPP PAE array can be seen as separate but not totally independent functional units. 



wo 2004/015561 



16 

Hardware 



all write backs blocked 
by in-use IRAMs 




wait 


coniisuration ^ 


execute 


finished 



PCT/EP2003/008080 



ireload needed 
urgently 




write back 



all preloadls blocked by 



dirty or in-^use IRAMs 



no dean 
IRAM instance^ 



discard LRU 
clean IRAM 




no empty 
IRAM instance 



Figure 4: State transition diagram for the XPP cache controller 

The XPP cache controller has several tasks. These are depicted as states in the above diagram. State 
transitions take place along the edges between states, whenever tiie condition for the edge is true. .As 
soon as the condition is not true any more, the reverse state transition takes place. The activities for tlie 
states are as follows: 

At the lowest priorily, the XPP cache controller lias to fulfill already issued preload commands, wti.ile 
writing back dirty IRAMs as soon as possible. 

As soon as a configuration finishes, the next configuration can be started. This is a more urgent ta.sk 
than write backs or future preloads. To be able to do that, all associated yet unsatisfied preloads ha^ve 
to be finished first. Thus they are preloaded with the high priority inherited fix>m the execute state. 



A preload in tuni can be blocked by an overlappiixig m-vse or dirty IRAM instance in a different t>Io<:k 
or by the lack of empty IRAM instances in tiie target IRAM block. The former can be resolved "by 
waiting for the configuration to finish and / or by a write back. To resolve the latter, the least recea-tly 
used clean IRAM can be discarded, thus becoming empty. If no empty or alean IRAM instance exis^ 
a dirty one has to be written back to the memory hierarchy. It cannot occur that no empty, clean or 
dirty IRAM instances exist, since only rinp-inctc>ir./^^^or, in-use and there should be more than ome 
instance in an IRAM block - otherwise no 
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In an SMT environment tlxe load FIFOs have to be replicated for every virtual processor. The pipelines 
of the functional units are fed from the shared fetch / r^sorder / issue stage. All £mctional units execute 
in parallel. Different units can execute instructions of different virtual processors. 

So we get the following design parameters with their smallest initial value: 

IRAM length: 128 words 

The longer the IRAIVI length, the longer the running time of the configuration and the less influ- 
ence the pipeline startup has. 

FIFO lengdi: 1 
This parameter helps to hide cache misses duriimg preloading; The longer the FIFO length, the 
less disruptive is a series of cache misses for a su::igle configuration. 

IRAM duplication factor: (pipeline stages+caching fac'tor)*virtual processors: 3 

Pipeline stages is the number of pipeline stages LD/EXAVB plus one for every FIFO stage 

above one: 3 

Caching factor is the number of IRAM duplicates available for cachuig: 0 

Virtual processors is the number of virtual proces sors with SMT: 1 

The size of the state of a virtual processor is mainly dependent on the FIFO length. It is: 

FIFO length * #IRAM ports * (32 bit (Address) + 32 bit (Size)) 
This has to be replicated for every virtual processor. 
The total size of memory used for Ae IRAMs is: 

#IRAM ports * IRAM duplication factor* IRAIwf length * 32 bit 

A first implementation will probably keep close to the above-stated minimum parameters, using a 
FIFO length of one,, an IRAM diq>lication factor of foox, an IRAM length of 128 and no simultaneous 
multithreading. 

2.5.2 Implementation Improvements 

Write Pointer 

To fiirther decrease the penalty for unloaded IRAMs, a simple wite. pointer nay be used per IRAM, 
which keeps track of the last address already in the IR^AM. Thus no stall is required, unless an access 
beyond this write pointer is encountered, lliis is espeoially useful, if all IRAMs have to be reloaded 
after a task switch: The delay to the ponfiguration start: can be much shorter, especially, if the preload 
engine of the cache contrbller chooses the blocking IRAM next whenever seveiral IRAMs need further 
loading. 

Longer FiFOs 

The frequency at the bottom of the memory hierarchy^ (main memory) caniio'l: be raised to the same 
extent as the frequency of the CPU core. To increase tihe concurrency between the RISC core and the 
PACT XPP core, the prefetch FIFOs in the above dravwing can be extended. Tlius the IRAM contents 
for several configurations can be preloaded, like the configurations themselves. A simple convention 
makes clear which IRAM preloads belong to which configuration: the configuxation execute switches 
to the next configuration context This can be accompl£shed by advancing the FIFO write pointer with 
every configuration execute, while leaving it unchange<l after every preload ITnassigned IRAM FIFO, 
entries keep their contents from the previous configuration, so every succeeding configumtion will use 
the preceding configuration's IRAMx if no different iR^AMx was preloaded. 
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If none of the memoxy areas to be copied to ER^AMs is in any cache, extending the FIFOs does ncDt 
help, as the memory is the bottleneck. So the cache size should be adjus-ted together witii the FIFO 
length. 

A drawback of extending the FIFO length is the ixicreased likelihood that the IRAM content written tDy 
an earlier configuration is reused by a later one in another IRAM. A caohe coherence protocol can 
clear the situation. Note however that the situation can be resolved more e^ily: If an overlap between 
any new IRAM area and a currently dirty IRAM contents of another IRAM bank is detected, the ne^ 
IRAM is simply not loaded until the write back the changed IRAM has finished. Thus Hxc execution 
of the new configuration is delayed* until the correct data is available. 

For a short (single entry) FIFO, an overlap is extremely unlikely, since the compiler will usually lea^ve 
the output IRAM contents of the previous configuration in place for the n&xt configuration to skip the 
preload. The compiler does so usmg a coalescing algorithm for the IRAMs / vector registers. The coa- 
lescing algorithm is tlie same as used for register coalescing in register allocation. 



Whenever the memoxy, that is used by the executing configuration, is the source of a preload com- 
mand for another IRAM, an XPP pipeline stall occurs: The preload can only be started, when the con- 
figuration has finished, and - if the content was na.odified - the memoiy content has been written to ttxe 
cache. To decrease tlxe number of pipeline stalls, it is beneficial to add an SLdditionairead-only IRA\A 
state. If the IRAM is read only, the content cannot be changed, and the preload of the data to the other 
IRAM can proceed without delay. This requires an extension to the preload mstnictions: The XppPre- 
load and the XppPreloadClean instruction foraiats can be combined to a single instruction format, that 
has two additional bits, stating whether the IRAM will be read and/or written. To support debugging, 
violations should be checked at the IRAM ports, raising an exception when needed 



The IRAMs are block-oriented structures, which can be read in any order \>y the PAE array. Howevear, 
the address generation adds complexity, reducing the number of PAEs available for the actual compui- 
tation. So it is best, if the IRAMs are accessed in linear order. The memory hierarchy is block oriented 
as well, further encouraging linear access patterns in the code to avoid cache misses. 

As the IRAM read ports limit the bandwidth betvw^een each IRAM and the PAE array to one word rea.d 
per cycle, it can be beneficial to distribute the data over several IRAMs to iremove ttiis bottleneck. Tbie 
top of the memory hierarchy is the source of the data, so the amount of cache misses never mcreases 
when the access pattern is changed, as long as the data locality is not destro^^ed. 

Many algorithms access memory in linear order "by definition to utilize block reading and simple acd- 
dress calculations. In most other cases and in the cases where loop tiling is needed to increase the daCa 
bandwidth between the IRAMs and the PAE arr&y» the code can be transformed in a way that data is 
accessed in optimal order. In many of the remaining cases, the compiler caxi modify the access pattem 
by data layout rearrangements (e.g.. array merging), so that finally the data is accessed in the desir&d 
pattem. If none of these optimizations can be used because of dependencies;, or because the data layouit 
is fixed, there are still two possibilities to improve performance: ' 

Data Duplication 

Data is duplicated in several IRAMs. This circumvents the IRAM read poirt bottleneck, allowing sewr 
eral data items to be read fi-om the input eveiy cycle. 



Read Only IRAMs 



2.5.3 Support for Data Distribution and Data Reorganization 
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Several options are possible with a common drawback: data duplication can only be applied to input 
data: output IRA^s obviously cannot have overlapping address ranges.. 

o Using several IRAM preload conunands specifying just diffident target IRAMs: 

This way cache misses occur only for the first preload. All other preloads will take place w^ithout 
cache misses - only the time to transfer the data firom the top of the memory hierarchy "to the 
IRAMs is needed for every additional load. This is only beneficial, if the cache misses plmas the 
additional transfer times do not exceed the execution time for the configuration. 

o Using an IRAM preload instmction to load multiple IRAMs concurrently: 

As identical data is needed in several IRAMs, they can be loaded concurrently by writing the 
same values to all of them. This amounts to finding a clean TRAM instance for every target 
IRAM, connecting them all to the bus and writing the data to the bus. 
The problem with this instruction is that it requires a bigger immediate field for the destination 
(16 bits instead of 4 for the XPP 64). Accordingly this instmction format grows at a higher rate, 
when the number of IRAMs is increased for bigger XPP arrays. 

The interface of this instruction looks like: 

XPPPreloadMultipIe (int IRAMS, void ^StartAddress, tnt Sixe) 

This instmction behaves as the XPPPreload / XPPPreloadCleanL instructions with the exception 
of the first parameter: 

The first parameter is IRAMS. This is an inunediate (constant) v^alue. The value is a bitmap - for 
every bit in the bitmap, the IRAM with that number is a target for the load operation. 

There is no ^clean" version, since data duplication is applicable for read data only. 

Data Reordering 

Data reordering changes the access pattern to the data only. It does not change the amount of memory 
that is read. Thus the number of cache misses stays the same. 

o Adding additional functionality to the hardware: 

o Adding a vector stride to the preload instruction. 

A stride (displacement between two elements in memoiy) is used in vector load op- 
erations to load e.g.: a column of a matrix into a vector register. 

This is still a linear access pattern. It can be implemented in hardware by giving a 
stride to the preload instruction and adding the stride t:o the IRAM identificationL state. 
One problem with this instruction is that the number of possible cache misses per 
IRAM load rises: In the worst case it can be one caclie miss per loaded value, if the 
stride is equal to the cache line size and all data is not in the cache. 
But as akeady stated: the total number of misses stays the same - just the distrilDution 
changes. Still this is an undesirable effect 

The other problem is the complexity of the implem^entation and a possibly limited 
throughput, as the data paths between the layers of the memory hierarchy ares opti- 
mized for block transfers. Transferring non-contiguous words will not use wide busses 
in an optimal fashion. 
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The interface of the instmction looks like: 
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XPPPreloadStride (int IRAIVI, void *StartAddress, iat Size, int Stride) 
XPPPreloadCleanStride (int IRAM, void *StartAddress, int: Size, int Stride) 

This instruction behaves as the XPPPreload / XPPPreloadClean instructions with the 
addition of another parameter: ' 
The fourdi parameter is the veotor stride. This is ah inmiediate (constant) value. It tells 
the cache controller, to load only eveiy value to the specified IRAM. 



The RISC can copy data at a maximum rate of one word per C3^cle for simple address 
computations and at a somewhat lower rate for more complex ones. 

With a memory hierarchy, the sources will be read from memory (or cache, if they 
were used recently) once and written to the tenaporary copy, wliich will then reside in 
the cache, too. This increases the pressure in the memory hierarchy by the amount of 
memory used for the temporaries. Since temporaries are allocated on the stack mem- 
ory, which is re-used frequently, the chances are good that the dirty memory area is re- 
defined before it is written back to memory. Hence the write back operation to mem- 
ory is of no concem. 

o Via an XPP configuration: 

The PAE array can read and write one value from every IRAM per cycle. Thus if half 
of the IRAMs are used as inputs and half of the IRAMs are used as outputs, up to 
eight (or more, depending on the number of IRAMs) values can be reordered per cy- 
cle, using the. PAE array for address generation. As the inputs and outputs reside in 
IRAMs, it does not matter, if the reordering is done before or after the configuration 
. that uses the data - the IRAMs can be reused immediately. 



IRAM Chaining 

If the PAEs do not allow further unrolling, but there are still IRAMs left unused, it is possible to load 
additional blocks of data into these IRAMs axid chain two IRAMs by means of an address selector. 
This does not increase throughput as much as unrolling would do, but it still tmelps to hide long pipe- 
line startup delays whenever unrolling is not possible. 



According to ttie design parameter changes and the corresponding changes to the hardware, the hard- . 
ware / software interface has changed. In the fallowing the most prominent changes and their handling 
will be discussed: 



o 



Reordering the data at run time, mtroducing temporary copies. 



0 



On the RISC: 



2.6 Software / Hardware Interface 



2.6.1 Explicit Cache 



The proposed cache is not a usual cache, which would be - not considering performance issues - in- 
visible to the programmer / compiler, as its operation is transparent. The proposed cache is an explicit 
cache. Its state has to be maintained by software. 
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Cache Consistency and Pipelining of Preload / Configuration / Write bacit 

The software is responsible for cache consistency. It is possible to have several IRiVMs caching the 
same, or overlapping memory areas. As long as only one of the IRAMs is written, this is perfectly ok: 
Only this IRAM will be dirty and will be written back to memory. If however more than one of the 
IRAMs is written, it is not defined, which data will be writtea to memory. This is a software bug (non 
deterministic behavior). 

As the execution of the configuration is overlapped with the preloads and write backs of the IRAMs, it 
is possible to create preload / configuration sequences, that contain data hazards. As the cache con- 
troller and the XPP array can be seen as separate functiona.1 units, which are effectrively pipelined, 
these data hazards are equivalent to pipeline hazards of a ii.ormal instruction pipeline. As with any 
ordinary pipeline, there are two possibilities to resolve this: 

• Hardware mterlocking: 

Interlocking is done by tlie cache controller If the cache controller detects, that the tag of a 
dirty or in-use item in IRAMx overlaps a memory area used for another TRAM preload, it has 
to stall that preload, effectively serializing the execation of the current configmration and the 
preload. 

• Software interlocking: 

If the cache controller does not enforce interlocking, the code generator has to insert explicit 
synchronize instructions to take care of potential interlocks. Inter- procedural and inter- 
modular alias- and data- dependency analyses can determine if this is the case, while schedul- 
ing algorithms help to alleviate the impact of the necessary synchronization instructions. 

In either case, as well as in the case of pipeline stalls due to cache misses, SMT can use the computa- 
tion power, that would be wasted otherwise. 

Code Generation for the Explicit Cache 

Apart from the explicit synchron.ization instructions issued v^th software interlocking, the following 
instructions have to be issued by the compiler. 

• Configuration preload.instructions, preceding the ERAM preload instructions, that will be used 
by that configuration. These should be scheduled as early as possible by the instruction sched- 
uler. 

• IRAM preload instructions, which should alsa be scheduled as early as possible by the in- 
struction scheduler. 

• Configuration execute instructions, following the IILAM preload instructions for that configu- 
ration. These instructions should be scheduled between tiie estimated minimum and the esti- 
mated maximum of the cumulative latency of their preload instructions. 

• IRAM sjmchronization instructions, which should be scheduled as late as possible by the in- 
struction scheduler. These instructions must be inserted before any potential access of the 
RISC to the data areas that are duplicated and poteotially modified in the IRAMs. Typically 
these instructions will follow a long chain of compuitations on the XPP, so they will not sig- 
nificantly decrease performance. 
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Asynchronicity to Other Functional Units 



A XppSyncQ must be issued by the compiler, if an iiustruction of another functional unit (mainly the 
L^St unit) can access a memory area, that is potentially^ dirty or in-use in an ERAM. This forces a syn- 
clu'omzation of the instruction streams and the cache oontents, avoiding data liazards. A thorough in- 
ter-procedural and inter-modular array alias analysis limits the frequency odf these synchronization 
instructions to an acceptable level. 



For the previous design, the IRAMs are existent ui sili<:on, duplicated several times to keep the pipe- 
line busy. This amounts to a. large silicon area, that is not fully busy all the tinae, especially, when the 
PAE array is not used, but as well whenever the coniigixration does not use all of the IRAMs present in 
the array. The duplication also makes it difficult to extend the lengths of the ERAMs, as the total size 
of the already large IRAM ajrea scales Imearly. 

For a more silicon efficient implementation, we should integrate the IRAMs into the first levql cache, 
nxaking this cache bigger. This means, that we have to extend the first level caclie controller to feed all 
IRAM ports of the PAE array. This way. the XPP and the RISC will share th« first level cache in a 
more efficient manner. Whenever the TSPP is executing, it will steal as much cache space as it needs 
from the RISC. Whenev^ titie RISC alone is runnmg il: will have plenty of additional cache space to 
improve performance. 

The PAE array has the ability to read one word and wi-ite one word to ^ch IRAM port every cycle. 
This can be limited to either a read or a write access per cycle, without limiting programmability: If 
data has to be written to the same area in the same cycle, another IRAM port can be used. This in- 
creases the number of used IRAM ports, but only under irare circumstances. 

This leaves sixteen data accesses per PAE cycle in the worst case. Due to the worst case of all sixteen 
memoiy areas for the sixteen IRAM ports mapping to tlie same associative bank, the minimum asso- 
ciativity for the cache is 16-way set associativity. This a^voids cache replacem^'t for this rare, but pos- 
sible worst-case example. 

Two &ctors help to support sixteen accesses per PAE array cycle: 

• The clock frequency of the PAE array generally has to b^ lower than for- the RISC by a factor 
of two to four. The reasons lie in the configurable routing channels with switch matrices which 
cannot support as high a frequency as solid point-to-point aluminium or copper traces. 

This means that two to four IRAM port accesses can be handled serially by a single cache port, 
as long as all reads are serviced before all writes, if there is a potential overlap. This can be ac- 
complished by assuming a potential overlap and enforcing a priority ordering of all accesses, 
giving the read Recesses higher priority. 

• A factor of two, four or eight is possible by accessing the cache as two, four or eight banks of 
lower associativity cache. 

For a cycle divisor of^ four, four banks of four-way associativity will be optimal. During four 
successive cycles, four different accesses can be served by each bank of" four way associativ- 
ity. Up to four-way data duplication can be haix<lled by using adjacent IRAM ports that are 
connected to the same bus (bank). For further data duplication, the data, has to be duplicated 
explicitly, using an XppPreloadMultipleQ cache controller instruction. Ttie maximum data du- 
plication for sixteen read accesses to the same memory area is supported by an actual data du- 



2.7 Another Infiplementation 
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plication factor of four: one copy in each bank. This does not affect flie RAM efSciency as ad- 
versely as an actual data duplication of 16 for the design proposed in section 2.5. 




Figure 6: Cache structure example 



The cache controller is running at the same speed as tke RISC. The XPP is. naming at a lower (e.g. 
quarter) speed. This way the worst case of sixteen read requests from the PAE array need to be serv- 
iced in four cycles of the cache controller, with an additional four read requests from the RISCl So one 
bus at fiill speed can be used to service four IRAM read ports. Using four-way associativity, four ac- 
cesses per cycle can be serviced, even in the case that alL four accesses go to addLresses that map to the 
same associative block. 

a) TTie RISC still has a 1 6-way set associative vievy of the cache, accessing all four four-way set 
.associative banks-in-parallelr-Due to data duplication it is possible, that several baniks return a 
hit This has to be taken care of with a priority encoder, enabluig only one bank onto the data 
bus: 

b> The RISC is blocked £rom the banks that service IRAM port accesses. Wait states are inserted 
accordingly. 

c) The RISC shares the second cache access port of a two-port cache witjh the RAM interface, 
using the cycles between the RAM transfers for its accesses. 

d) The cache is extended by a fifth 4-way set associative bank, used exclusively by the RISC. 
(The other banks are only accessed, when they axe not used by the curr&:nt XPP configuration. 
PROBLEM: dirty line in a blocked bank) 
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-> 2 port RAM, concurrent reads ok, concurrent R/W to same cache line avoided by SW synchroniza- 
tion / HW arbitren 

Another problem is that a read could potentially address the same memory location as a write; the 
value read depends on the order of the operation,.so the order is fixed: all writes have to take place 
afier all reads, but before the reads of the next cycle^ except, if the reads and writes actually do not 
overlap. This can only be a problem with data duplication, when only one copy of the data is actually 
modified. Therefore modifications are forbidden with data duplication. 



2.7.1 Programming Model Changes 

Data Interference 

With this design without dedicated IRAMs, it is not possible any more to load input data to the IRAMs 
and write the output data to a different IRAM, which is mapped to the same address, thus operating on 
the original, unaltered input data during the whole condBguration. 

As there are-no dedicated-IRAMs any more, writes dixectly modify the cache contents, which will be 
read by succeeding reads. This changes the programming model significantly. Additional and more in- 
depth compiler analyses are niecessary accordingly. 



2.72 Hiding Implementation Details 

The actual number of bits in the destination field of the XppPreloadMultiple Lnstruction is unplemen- 
tation depmdent It depends on the number cache banlcs and their associativit>^, which are determined 
by the clock frequency divisor of the XPP PAE array relative to the cache fi-equency. 
However, this can be hidden, by the assembler, who translates IRAM ports to cache banks, thus re- 
ducing the number of bits firom the number of IRAM ports to the number of banks. For the user it is 
sufficient to know, that each cache bank services an adjacent set of IRAM portrs starting at a power of 
two. Thus it is best to use data duplication for adjacexit ports, startmg with tire highest power of two 
bigger than the number of read ports to the duplicated area. 
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3 Program 



Optimizations 



3.1 Code Analysis 



In this section we describe -the-analyses that can be performed on programs. These analyses are then 
used by different optimizations. They describe the relationships between data and memory locations in 
Ae program. More details can be fomid in several books [2,3,5]. 



Data-flow analysis examines the flow of scalar values through a program, to provide information 
about how the program manipulates its data. This information can be represented by dataflow equa- 
tions that have the following general form for object i, that can be an instruction or a basic block, de* 
pending on the problem to solve: 



It means that data available at the end of the execution of object i, Ex[i]^ are either produced by 
ProdfiJ or were alwe at the beginning of /, In[i]^ but were not deleted during Ae execution of i. 

These equations can be used to solve several problems like: 

■ the problem of reaching definitions, 

■ the Def-Use and Use-Def chains, describing respectively for a definitioai all uses that can b© 
reached from it, and for a use all definitions that can reach it 

■ the available expressions at a point in the prograni, 

■ the live variables at a point in the program, 

whose solutions are then used by several compilation phases^ analysis, or optimizations. 

As an example let us take the problem of computing the Def-Use chains of the •variables of a program. 
This information can be xased for instance by the data dependence analysis for- scalar variables or by 
the register allocation. A Def-Use chain is associated to each definition of a vuiable and is the set of 
all visible uses from this definition. The data-flow equations presented above are applied to the basic — 
blocks to detect the variables that are passed froni one block to another along tJhe control-flow graph. 
In the figure below, two definitions for variables: are produced: SI in SJ and S4 in B3. Hence the vari- 
able that can be found at the exit of B7 is Ex(Bl)^{x:(Sl)), and at the exit of .B^ is Ex(B4)-={x(S4)}, 
Moreover we have Ex(B2J=Ex(BI) as no variable is defined in B2. Using these? sets, we find that the 
uses of X in S2 and S3 depend on the definition of in J?7, that the use of x m SF depend on the defini- 
tions of' jc in ^7 and B3. The Def-use chains associated with the definitions are then 



3.1.1 Data-Flow Analysis 



DiS\) = {52,53,55} and Z)(54) = {55} . 



wo 2004/01551 



26 

Prograitx Optimisations 



PCT/EP2003/008080 



Bl 

SI: x=«... 




B4 

S5:...— X 



Figure. 7:Control'floiw graph of a piece of program 



ZM Data Dependence Analysis 

A data dependence graph represents the dependences existing between operations writing or reading 
the same data. This graph is used for optimizations like scheduling, or certain loop optimizations -to 
test their semantic validity. The nodes of the gr^h represent the instructions, and the edges represexit 
the data dependences. These dependences can be of three types: true (or flow) dependence when a 
variable is written before being read, anti-dependence when a variable is r&ad before being writtexi, 
and output dependence when a variable is written twice. Here is a more formal definition [3]. 

Definition . ' . 

Let S and S' be 2 statements, then S' depends on 5, noted 5 6 iff: 

(1) S is executed before 5' 

(2) 3v€VAR:v€D£F(S)l USEiS')\fvsUSE(S)I DEF(S')^veDSF{S)I DEF(S') 

(3) There is no stateihent T such that 5. is e:xecuted before T and T is . executed before S\ ati.<i' 
v,^DEFiT) 

Where VAR is the set of the variables of the program, DEF(S) is the set of "the variables defined l>y 
instruction and USE(S) is the set of variables used by instruction 5. 

Moreover if the statements are in a loop, a dependence can be loop-indepenclent or loop-carried. ThJs 
notion introduces the definition of the distance of a dependence. When a dependence is loop- 
independent it means that it occurs between two instances of different statements in the same iteration, 
and then its distance is equal to 0. On the contrary when a dependence occurs between two instances an 
two different iterations the dependence is loop-carried, and the distance is equal to the difference be- 
tween the iteration numbers of the two instances. 
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The notion of direction of dependence generalizes the notion of dista.nce, and is generally used when 
the distance of a dependence is not constant, or cannot be computed with precision. The direction of a 
dependence is given by if the dependence between and iS' occurs when the instance of iS" is in aa 
iteration before the iteration of the instance of5", = if the two instances arc in the same itmtion-, and > 
if the instance of iS' is an iteration after the iteration.of the instance of iS''. 

In the case of a loop nest, we have then distance and direction vector, with one element for each level 
of the loop nest. The figures below illustrate all these definitions. The data dependence graph is used 
by a lot of optimizations, and is also useful to determine if their application is valid. For inst:ance a 
loop can be vectorized if its data dqsendence graph does not contain arty cycle. 



Figure 8: Example of a true dependence with distance Q onarraya 

for(i=0; i<N; i=i+l ) { 
S: a[i] = b[i] + 
SI = c[i] + 2; 



for(i==0; i<N; i=i+l) { 
S: a[i] = b[i] +. 1; 
SI: c[i] » a[i] + 2; 
> 





5*o 



Figure 9: Example of an ^mti-dependence with distance 0 on array b 



for (i«0; i<N; { 
S: - a[i] « b[i] +' 1;- 
Si: a[i] = c[i] + 2;' 




5^G 



Figure 10: Example of an €>ntput dependence with distance 0 on array 

a 
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for(j=0; j<=N; j++) 
for(i-=0;i<»N;i++) 
( 

SI: c[i] [j] = 0; 

f or ( k=0 ; k<-N ; k++ ) 
S2: c[i] [j] = c[i] [j] 
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a[i][kj*b[k][j]; 




Figure 11: Example of a dependence with direction •vector(=,=) be- 
tween SI and S2 and 43 dependence with direction vector =, <) be- 
tween S2 and S2. 



for(i=0;i<=N;i++) 

foar(j=0;j<=N;j++) 
S: a[i] [j] = a[i] [j+2] + b[i]; 




Figure 12: Example €?f an anti-dependence with distance vector (0,2), 



3.1.3 Interprocedural Alias Analysis 

The aim of alias analysis is to determirie if a memory location is aliased by several objects, like vari- 
ables or arrays, in a program. It has a strong impact on data dependence analysis and on the . application 
of code optimizations. Aliases can occur with statically allocated da.ta, like unions in C where all fields 
refer to the same memory area, or with dynamically allocated data, which are the usual taxgets of the 
analysis. In Figure 13, we have a typical case of aliasing where p alias b. 

int b[100],*p; 
for (p=b;p < &b[100].;p++) 
*p=0; 



Figure 13: Example for typical aliasirzg 

Alias analysis can be more or less precise depending on whether or not it takes the control-flow into 
account. WTien it does, it is called flow-sensitive, and when it does not, it is. called flow— insensitive. 
Flow-sensitive alias analysis is able to detect in which blocks along a path two objects are aliased. As 
it is more precise, it is more complicated and more expensive to compute. Usually flbw^-insensitive 
alias information is sufficient. This aspect is illustrated ihFigure \A where a flow-insensitLAre analysis 
would find that p alias 6, but where a flow-sensitive analysis would be able to find thatp cxlias b only 
in block B2. 

Furthermore aliases are classified into must-aliases and may-aliases. For instance, if we consider flow- 
insensitive may-alias information, then x alias iff x and y may, possibly at different tiinies, refer to 
the same nxemory location. And if we consider flow-insensitive must-alias information,x ar//as iffjc 
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and y must, throughout the execution of a procedure, refer to the same storage location. In the case of 
Figure 14, if we consider flow-insensitive may-alias information,/; alias b holds, wliereas if we con- 
sider flow-insensitive must-alias information,/? alias b does not hold. The kind of information to use 
depends on the problem to solve. For instance, if we want to remove redundant expressions or state- 
ments, must-aliases must be used, whereas if we want to build a data dependence graph may-aliases 
are necessary. 



Bl 

int^p, b[100]; 



B2 *p=:b; 
uses of b and p > 
♦p = xnallocO; 



B3 

*p - m^allocQ; 
<i]sesof 'bandp> 



B4 

<uses of b and p> 



J^igure 14:Example of controls/low sensitivity 



Finally this analysis must be Lnterprocedural to be able to detect aliases caused by non-local variables 
and parameter passing. The latter case is depicted inFigiire IS \^ere f and J are aliased through the 
function call where k is passed twice as parameter, 

void foo(int *i, int* j) 
{ 

*i « *j'M; 

} 

* foo(&k,&k).; 



F(gt^/•e 15: Example for aliasing by parameter passing 



3. 1 .4 lnterprocedural Value Range Analysis 

This analysis can find the range of values taken by the variables. It can help to apply optimizations like 
dead code elimination, loop unrolling and others. For this purpose it can use information on the types 
of variables and then consider operations applied on these variables during the execution of the pro- 
gram. Thus it can determine for instance if tests in conditnonal instruction are likely to be met or not, 
or determine the iteration range of loop nests. 

This analysis has to be interprocedural as for instance loop bounds can be passed a_s parameters of a 
function, like in the following example. We know by analj^zing the code that in the loop executed with 
array a, N is at least equal to 1 1 , and that in the loop execated with array b^N\& at most equal to .1 0. 
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void foo(int *c,int N) 
{ 

int i; . 

for (i=0;i<N;i++) 
c[i] - g(i,2)- 

} 



if (N •> 10) 
f'oo (a,N) ; 

else 

foo(b,N); 



The value range analysis can be supported by ttie programmer by giving further value constraints 
which cannot be retrieved from the language semantics. This can be done by pragmas or a compiler 
known assert function, 

3.1.5 Alignment Analysis 

Alignment analysis deals with data layout for distributed memory architecture s. As stated by Saman 
Amarasinghe: "Although data memory is logically a linear array of cells, its r-eaiization m hardware 
can be viewed as a multi-dimensional array. Given a dimension in this array, alignment analysis will 
identify memory locations that always resolve to a single value in that dimension. For example, if the 
dunension of mterest is miemory banks, alignment analysis will identify if a m&moiy reference always 
accesses the same bank". This is the case in the second part of the figure belovy, that can be found in 
[10], where all accesses, depicted in blue, occur to -the same memory bank, wtiereas in the first part, 
the accesses are not aligned. He adds ttien that: ''Alignment information is useful in a variety of com- 
piler-controlled memory optimizations leadmg to unprovements in programmability, performance, and 
energy consumption." 
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Aligiunent analysis, for instance, is able to help fmd a good distribution scheme of the data and is fur- 
thermore useful for automatic data distribution tools. An automatic alignment aoialysis tool can be able 
to automatically generate alignment proposals for the arrays accessed in a procedure and thus simpli- 
fies the data distribution problem. TTiis can be extended with an interproceduxal analysis taking into 
account dynamic realignment. 
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Alignment analysis can also be used to apply loop alignment that transforms the code directly rather 
than the data layout in itself, as shown later. Anotiier solution can be used for the PACT XPP, relying 
on the fact tiiat it can handle aligned code very efficiently. It consists in adding a conditional instruc- 
tion testing if the accesses in the loop body are aligned followed by the necessary numl>er of peeled 
iterations of die loop body, then the aligned loop body, aod then some compensation code. Only the 
aligned code is then executed by the PACT XPP, the rest is executed by the host processor. die 
alignment analysis is more precise (inter-procedural or inter-modular) less conditional code has to be 
mserted. 



3.2 Code Optimizations 

Most of the optimizations and transformations presented here can be found in detail in [4-], and also in 
[2,3,5]. 



' 32.1 General Transformations 

We present in this section a few general optimizations that can be applied to straightforw^d code, and 
to loop bodies. These are not the only ones that appear ia a compiler, but they are menLtioned in the 
sequel of this document. 



Constant Propagation 

This optimization propagates the values of constants into the expressions using them tlxroughout the 
program. This way a lot of computations can be done statically by the compiler, leaving: less work to 
be done during the execution, this part of the optimization is also known as constant folding. 

N = 256; for(i=0; i<= 256; i++) 
c = 3; a[i] . = b[i] + 3; 

for{i=0;i <= N;i++) 
a[i] = bCi] + c; , 

Figure 16: Example of constant propagation 



Copy Propagation 

This optimization simplifies the code by removing redundant copies of the same variable in the code. 
These copies can be produced by the pj^ogrammer himself or by other optimizations. This optunization 
reduces the register presst&re and the number of register-to-register move instructions. 

t = i*4; t = i*4;. 

r =» t; .for(i=0;.a <= N;i++) 
for { 1=^0 ; i <= N;i++) a[t] = b[t] +ati]; 

a[rr = b Cr] + a[i]; - 

Figure 17: Example of copy propagation 



Dead Code Elimination 

This optimization removes pieces of code that will never be executed. Code is never execxited if it is in 
the branch of conditional statement whose condition is always evaluated to true or false, or if it is a 
loop body, whose number of iterations is always equal to 0. 
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Code updating variables, that are never used, is also useless and can be removed as well. If a variable 
is never used, then the code updating it and its declaration can also be elina^inated. 

for{i=0;i <= N;i++) { for(i=0;i <« N;i-H+) { 

for{ j-0;j<0;j++) for ( j=0; j<10; j++) 

aCj] « b[j] + a[i]; " a[j+l] ?= atj] + b[j]; 

for{ j=0;j<10;j++) } 
= a[j] + b[j3 ; 

} 

Figuxe 18: Example of dead code elimination 

Forward Substitution 

This optimization is a generalization of copy propagation. The use of a variable is replaced by its de- 
fining expression. It can be used for simplifying the data dependency analysis and the applicatioo of 
other transformations by making the use of loop variables visible. 

c « N + 1; for(i«0; i<=» N; 

for(i-0;± <= N;i++) a[N+l] « b[N+l] + a[i]; 

a[c] - b[c] + a[i] ; 

Figure 29: Excartple of forward substitution 



Idiom Recognition 

This transformation recognizes pieces of code and can replace them by calls to compiler known fiKnc- 
tions, or less expensiA^e code sequences, like code for absolute value computation. 

for{i=0; i<N; i++) { for(i=0; i<N; i++) { 

c = a[i] - bti]; c = a[i] - b[i] p 

if (c<0) c = .abs|c) ; 

c = rc; d[i] = c; 

d[i] - c; • } 

} 



Figure 20: Example of idiom recognition 



3.2.2 Loop Transformations 

Loop Normalization 

This transformation ensures that the iteration space of the loop is always writh a lower bound equal to 0 
or 1 (depending on the input language), and with a step of 1. The array subscript expressions and the 
bounds of the loops are modified a)ccordingly. It can be used before loop jEiision to find opportunities, 
and ease inter-loop dependence analysis, and it also enables the use of dependence tests that ne^eds 
normalized loop to be applied^ 

for(i=2; i<N/ i=i+2) for{i=0; i<(N-2)/2; i++) 

a[i] =b[i]; a[2*i+2] - b[2*l.+2] ; 



Figure 21: Example of loop normalization 
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Loop Reversal optimizations 

This transfoimation changes the direction in which the iteration space or a loop is scanned. It is u.sually 
used in conjunction witfi loop normalization and other transformations, like loop interchange, because 
it changes the dependence vectors. 

for(i=N;' i>=0; i~) for(i=0; i<=N; 1+V ) 

a[i] = b[i]; a[i] = b[i]; 

Figure 22: Example of loop reversed 



Strength Reduction 

This transformation replaces expressions in the loop body by equivalent "but less expensive ones. It can 
be used on induction variables, other than the loop variable, to be able to eluninate them. 

for(i=»0; i<N; i++) t = c; 

a[i3 = b[i] + c*i; for(i=0; i<N; i++) { 

a[i] - b[i] + t; 
t = t + c; 

} 

Figure 23: Example of strength reduction 

Induction Variable Elimination 

This transformation can use strength reduction to remove induction vaxiables from a loop, hen.ce re- 
ducing the number of computations and easing tjie analysis of the loop. This also removes depetiLdence 
cycles due to the update of the variable, enabling vectorization. 

for(i=0; i<=N; i++) { for(i=0; i<=N; ) { 

k = k + 3; a[i] = b[i] + a [k+ (i+1) *3] ; 

a[i] = b[i] + a[kl ; } 

} 

k « k +(N+1)*3; 
Figure 24: Example of induction vexriable elimination 

Loop-Invariant Code Motion 

This transformation moves computations outside a loop if their result is the same in all iterations. This 
allows to reduce the number of computations in the loop body. This optimization can also be con- 
ducted in the reverse ^hion in order to get perfectly nested loops, thai: are easier to handle by" other 
optimizations. 

for{i=0; i<N; i++) if (N >= 0) 

a[i] = b[i] + x*y; c = x*y; 

for{i=0; i.<N; i++) 
a[i] « b[i] + c; 



Figure 25: Example ofloop-invariant code motion 
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Loop Unswitchmg 

This transfomiation moves a conditional instruction outside of a loop body if its conditioa 5s loop- 
invariant The branches of the condition are then made of the original loop with the appropriatze origi- 
nal statements of the conditional statement. It allows further parallellzation of the loop by resinoving 
control-flow in the loop body and also removing unnecessary computations fix)m it 

for(i=0; i<N; i++) { if (x > 2) 

a[i] = b[i] + 3; for(i=0; i<N; zl++) { 

if (X > 2) a[i] « b[i] + 3; 

b[i] = c [i]* + 2; b[i] = c[i] + 2; 

else } 
b[i] = c[i] - 2; else 
} for (i=0; i<Nf d.++) { 

a[i] = bti] + 3; 
b[i] = c(i] - 2; 

} 



Figure 26: .Example of loop unswitchu^ 



If-Conversion 



This transfomiation is applied on loop bodies with conditional instruct:ions. It changes control depen- 
dences into data dependences and allows then vectorization to take plac^e. It can be used in conj nnction 
with loop unswitching to handle loop bodies with several basic blocles. The conditions, wheare array 
expressions could appear, are replaced hy boolean terms called guards. Processors with predicated 
execution support can execute directly such code. 

ford = 0;i < N; i++) { for{i « 0;i < N;i+H-) { 

a[i] = a[i] + b[i]; a[i] = a[i] + b[±] ; 

if (a[i] != 0) c2 = (a[i] 1= O); 

if (a[i] > c[i]) if {c2) c4 = (a[i] > c[i]); 

a[i] - ati] - 2; if (c2 && c4) a[i] = a[i] - 2; 

else if (c2 && !c4) a[i] = a[i] + 1; 

a[i] =» a[i] + 1; d[i] = a[i] * 2; 

d[i] = a[i] * 2; } 

} 

Figure 2 7: Example of if-comersion 

Strip-Mining 

This transformation enables to adjust the granularity of an operation. It is commonly used to choose 
the number of uidependent computations in the inner loop nest When the iteration, count is not known 
at compile time, it can be used to generate a fixed iteration count inner loop satisfying the iresource 
constraints. It can be used in conjunction Avith other transformations like loop distribution or loop in- 
terchange. It is also called loop sectioning. Cycle shrinking, also callect stripping, is a specialization of 
strip-mining. 

for(i=0; i<N; i++) up = (N/16)*16; 

a [i] = b[i] + c; for(i«0; i<up; i = i + 16) 

a[i:l+16] fc>[i:i+16] + g; 
for(j=i+l;j<N;j+-^) 
a[i] = b[il -i- c; 



Figure 28: Example of strip-mining 
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Loop Tiling 

This transformation modifies the iteration space of a loop nest by introducing loop levels to divide the 
iteration space in tiles. It is a muhi-dimensional generalization of strip-mining. It is generally used to 
improve memory reuserbut-camalsorimprove processor, register, TLB, or page locality. It is also 
called loop blocking. 

The size of the tiles of the iteration space is chosen so that the data needed in each tile fit in -the cache 
.memory, tiius reducing the cache misses. In the case of coarse-gnain computers, the size of the? tiles can 
also be ctiosen so that the numb^ of parallel operations of the loop body fit the number of pat)cessors 
of the computer. 

for(i=0; i<N; i++) for(ii=0; ii<N; i.i = ii+16) 

for(j=0; j<N; j++) for(jj-0; jj<KX; jj = jj + 16) 

a[i] Ej] = b[j][i]; ■ for{i=ii; . i< min (ii+15, N) ; i++) 

for(j=jj; j< min(jj + 15,N); 
a[i-] [j] = b[j] [i]; 

Figure 29: Example of loop tilitig 

Loop Interchange 

This transformation is applied to a loop nest to move inside ox outside (depending on the searched 
effect) the loop level contaming data dependences. It can: 

■ enable vectorization by moving inside an independent loop amd outside a dependent loop, or 

■ improve vectorization by moving inside the independent loop v^ith the largest range, or 

■ deduce the stride, or 

■ increase the number of loop-invaiiant expressions in the inner-loop, or 

■ improve parallel performance by moving an independent loo j) outside of a loop nest to incrrease the 
granularity of each iteration and reduce the number of barrier synchronizations. 

• for(i=0; i<N; i++) for(j=0; i]<N; j++) 

. for(j=0;j<N; j++) for{i«0; i<N; i++) 

a[i] '= a[i] + to[i][j]; a[i2 = a[i] + b[i]tjl; 

FlgUF'e 30: Example of loop intercf^ange 



Loop Coalescing / Collapsing 

This transformation combines a loop nest into a single loop. It can improve the scheduling of" the loop, 
and also reduces the loop overhead. Collapsing is a simpler version of coalescing in which th« number 
of dimensions of arrays is reduced as Avell. Collapsing reduces thie overhead of nested loops a3id multi- 
dimensional arrays. Collapsing can be applied to loop nests tha.t iterate over memory with. a. constant 
stride, otherwise loop coalescing is a better approach. It can be used to make vectorizing profitable by 
increasing the iteration range of the innermost loop. . 

±or(i=0; i<N; i++) for(k*0; k<N*M; .k++) { 

for(j=0;j<M; j+4-) i = C(k-l)/m)*m + 1; 

a[i] [j] « a[i] [ j] + c; j = C (T-l)%m) + 1; 

ati] Cj] = a[i] [j] + c; 

} 



Fignre 31: Example of loop coalescing 
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Loop Fusion 

This transformation, also called loop jamming, merges 2 successive loops. It reduces loop overhead, 
increases instruction-level parallelism, improves register, cache, TLB or page locality^, and improves 
the load balance of parallel loops. Alignment can be taken into account by introducing conditional 
instructions to take care of dependences. 



for (i=0; 
a[i] 

for(i«0; 
d[i] 



i<N; i++) 
= b[i] + c; 

i<N; i++) * 
= eti] + c; 



for(i=0; i<N; i++) { 

a[i] = b[i] + c; 

d[i] = eti] + c; 

} 



Figure 32: Exampie of loop Jusion 



Loop Distribution 

This transformation, also called loop fission, allows to split: a loop in several pieces izi case the loop . 
body is too big, or because of dependences. The iteration ^pace of the new loops is the same as the 
iteration space of the original loop. Loop spreading is a mor© sophisticated distribution. 



for(i=0; i<N; i++) { 
a[i] « b[i] 4- c; 
d[i] = e[i] -h e; 

) 



for(i=0; 
a[il 



i<N; i++) 
= b[i] + c; 



for(i=0;i<N; i++) 
d[i] - e[i] +c; 



F'igure 33: Example of loop distribution 



Loop Unroiling / Unroii-and-Jam 

This transformation replicates the original loop body in order to get a larger one. A loop can be un- 
rolled partially or completely. It is used to get more opportunity for parallelization by naaking the loop 
body bigger, it also improves register, or cache usage and reduces loop overhead. Loop unrolling the 
outer loop followed by mergmg the induced inner loops is referred .to as unroU-and-jam. 



for(i=0; i<N; i++) 
a[i] = b[i] + c 



for{i'=0; i<N; i = i+2) { 
a[i] = b[i] + c; 
a[i+l ] = b[i+l] + 



} 

if 



((N-1) %2) 
a[N-l ] » 



1) 
b[N-l] 



c; 



+ c; 



Figure 34: Example of loop tdnroUing 



LoopAiignment 

This optimization transforms the code to get aligned array accesses in the loop body^ Its effect, it to 
transform loop-carried dependences into loop-independent dLependences, which allows to extract more 
parallelism from a loop. It can use different transformations., like loop peeling or introdxice conditional 
statements, to achieve its goal. This transformation can be used in conjunction with loop fusion to 
enable this optimization by aligming the array accesses in tsoth loop nests. In the exaixiple below, all 
accesses to array a become aligned. 
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for(i=2;i <= N;i+-+) { for (1=1;- i<=N; i++) { 

a[i] = b[i] + c[i]; if Ci>l) a[i] = b[i] + c [i] ; 

d[i] = a[i-l] * 2; if C i<N) d[i+l] = a[i] * 2; 

e[i]. = a[i-l] + d[i+l]; if ( i<N) e[i+l] = a[i] -h d[i+2]; 

> } 

Figure 35: Example of loop alignment 



Loop Skewing 

This transformation is used to enable parallelization of a loop nest It is usefiil in combination with 
loop interchange. It is perforrhed by adding the outer loop mdex multiplied by a sicew factorJJ to the 
bounds of the inner loop variable, and then subtracting the same quantity from every use of the inner 
loop variable inside the loop!. 

for{i=l; i <= W; i++) for(i=l; i <= N; i++) 

for(j=l;j <= N; j++) for(j=i-M;j i+N; + ) 

a[i] = a[i+j] + c; a[i] = a[j] + c; 

Figure 36: Example of loop skewing 

Loop Peeling 

This transformation removes a small number of beginning or ending iterations of a loop to avoid de- 
pendences in the loop body^ These removed iterations are executed separately. It can be used for 
matching the iteration control of adjacent loops to enable loop fusion. 

for(i-0; i<=N; i+-H ) a[0][N] = a[0] [N] + a [KT] [N] ; 

a[i][N] = a[0] [N] + a[N][N]; for (i«l;i<=N-l; i++) 

a[i][N] = a[0][N] 4- a[N][N]; 
a[lSI][N] - a[0][N] + aCNTHN]; 

Figure 37: Example of loop peeling 

Loop Splitting 

This transformation cuts the iteration space in pieces by creating other loop nests. It is also called In- 
dex Set Splitting, and is generally used because of dependences that prevent parallelization. The itera- 
tion space of the nev^ loops is a subset of the original one. It can be seen as a get]ieralization of loop 
peeling. 

for(i=0; i<=N; for(i=0;i< (N+1) /2; i++) 

a[i] = a[N-d.+l] + c; a[i] = a[N-i+l] + c; 

for(i= (N+l)/2;i <= N;i+-I-) 
a[ il = a[N-i+l] + c; 

Figure 38: Example qff^op splitting 

Node Splitting 

This transformation splits a statement in pieces. It is used to break dependence cycles in the depend- 
eBce graph due to the too high granularity of the nodes, thus enabling vectorization of the statements. 



for(i=0;i < N;i+-+) { 

b[i.] a[i] c[i] * d[i] ; 



for(i = 0,i < N;i++) { 
tl[i] = c[i] * d [i]; 
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a[i+l] « * (d[i] - c[i] ); t2[i] » d[3-) - c[i]; 

b[i] = a[ij + tl[i]; 
a[i+l] « bCi] * t2[i]; 

} 



Figure 39: Example of node splitting 



Scalar Expansion 



This transfotmation replaces a scalar in a loop by an array to eliminate dependences in the loop bo(3y 
and enable parallelization of the loop nest. If tiie scalar is used after the loop, compensation code mirst 
be added. 

for(i=0; i<N/ i++) { fo^r (i=0;i<N; i++) { 

c = b[i] ; ■ tmp[i] « b[i]; 

a[i] = a[i] + c; a[i] = a[i] + tmp [i] ; 

} 1 

c = tmp[N-l] ; 

Figure 40: Example of scalar expansion 

Array Contraction / Array Shrlnidng 

This transformation is the reverse transformation of scalar expansion. It msy be needed if scalar ex- 
pansion generates too many memory requirements. 

for(i=0; i<N;H-+) for(i=0; i<N;i-h+> 

for.(j=0; j<N;j++) { for(j=0; j<N^j++) { 

t[il[j] = a[i][j] * 3; t[j] = a[i].[j] * 3; 

b[i][j] = t[i]tj] + c[j]; b[i][j] = t[j] + c[j]; 

} } 



Figure 41: Example of array contraction 



Scalar Replacement 



This transformation replaces an invariant array reference in a loop by a scalar. This array element Is 
loaded in a scalar before the inner loop and stored a.gain after the inner loop, if it is modified. It can be 
used in conjunction with loop interchange. 

for(i=0; i<N; for{i=0;i<N; i++)£ 

for{j°0; j<N;j++) tmp = a[i]; 

' a[i] = a[i] + b[i][j]; for{j=0; j<N;3++) 

• trap - tmp + b[±] [j]; 
a[i.] = tmp; 

} 

Figure 42: Example cyf scalar replacement 

Reduction Recognition 

This transformation allows to handle reductions m loops. A reduction is an operation that computes a 
scalar value from arrays. It can be a dot product, the sum or minimum of a vector for instance. Th^e 
goal is then to perform as many operations in parallel as possible. One way is to accimiulate a vector 
register of partial results and then reduce it to a scalar witfi a sequential loop. Maximum parallelism ns 
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then achieved by reducing the vector register with a tree: pairs of elements are summed, then pairs of 
these results are summed, etc. 

for(d.=0; i<N;i++) for(i=0; i<N; i=i+64 } 

s s + a[i]; tinp[0:63] = taip[0:63] + a[i:i+63]; - 

for{i=0; i<64;i++) 
s = s + tmpCi]; 

Figure 43: Example of reduction rect^ition 

Loop Pushing / Loop Embedding 

This transformation replaces a call in a loop body by the loop in -the called function. It is an inter- 
procedurai optimization. It allows the parallelization of die loop nest and elimmates the overhead 
caused by the procedure call. Loop distribution can be used in conjunction with loop pushing. 

for(i=0; i<N; i++) f 2 (x) 

f (x.i); 

void f2(int* a) { 
.void f(int* a.int j) { fpr(i=0; i<Nr; i++) 

a[j] = a[j] + c; • a[i] = a. [i] + c; 

} ) 

Figure 4-4: Example of loop pushing 

Procedure Inlining 

This transformation replaces a call to a procedure by the code of the procedure itself. It is an inter- 
procedural optimization. It allows a loop aest to be i^rallelized, renoioves overhead caused by^ the pro- 
cedure call, and can improve locality. 

for(i==0; i<N; i++) for(i«0; i<N; L++) 

f(a,i); a[i] = a[i] + c; 

voici f(int* int j){ " 
x[j] r + c; 

} 

Figure 45: Example of procedure inlining 

Statement Reordering 

This transformation schedules instructions of the loop body to modify the data dependence graph and 
enable vectorization. 

for <i=0;i <' N;i++) { for(i=0; i<N; i.++) { 

a[i] = b[i] * 2; c[i] = a[i-l] - .4; 

c[i] = ati-1] - 4; ■ a[i] = b[i] * 2; 

} ' . } 

Figure 46: Ebcample of statement reordering 
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Software Pipelining 

This transformation pamUelizes.aJoop body by schedimling instructions of different instances of the 
loop body. It is a powerful optimization to improve instruction-level parallelism. It can be used in 
conjunction with loop unrolling. In the example below, "the preload commands can be issued one after 
another, each taking only one cycle. This time is just enough to request the menioi:y areas. It is not 
enough to actually load theni. TTiis takes many cycles, depending on the cache level that actually has 
the data. Execution of a configuration behaves similarly. The configuration is issued in a single cycle, 
waiting until all data are present Then the oonfiguration executes for many cycles. 
Software pipelinmg overlaps the execution of a configuration with the preloads for tlxe next configura- 
tion. This way, the XPP array can be kept busy in parall&l to the Load/Store unit 

. Issue Cycle Command 



XPPPreloadConfigr (CFGl) ; 
for (i=0; i<100; ++i) { 

XPPPreload(2, a--flO*i, 10) ; 

XPPPreload(5,h>+20*i,20) ; 

// delay 

XPPExecute (CFSl ) ; 

} 



Issue Cycle Command 

Prol-ogue XPPPreloadConfigr (CFGl) ; 
XPPPreload ( 2 , a, 1 0 ) ; 
XPPPreload{5,b, 2 0) ; 
// delay 

for (i=l; KlOO; ++i) { 
Kernel 1: XPPExecute (CFG 1) ; 

2: XPPPreload(2, a +10*1,10) ; 
3 : XPPPreload (5, b -h20*i, 20) ; 
4: } 

XPPExecute (CFGl) ; 
Epilog // delay 

Figure 47: Example of software pipelining 



Vector Statement Generation 

This transformation replaces instructions by vector instruictibns that can perform an operation on sev- 
eral data in parallel. 



for(i=0; i<=N; i++) 
a[i] = b[i]; 



a[0:N] =b[0:N]; 
JFigure 48: Example of vector statement generation 



3.2.3 Data-Layout Optimizations 

In the follo\ving we describe optimizations that modify tlie data layout in memory irx order to extract 
more parallelism or prevent memory problems like cache misses. 
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Scalar Privatization 

This optimization is used in multi-processor systems to increase the amount of par^lelism and avoid 
unnecessary communications between the processing elements. If a scalar is only tEsed like a tempo- 
rary variable in a loop body, then each processing el&ment can receive a copy of it and achieve its 
computations with this private copy. 

for{i=0;i <= N;3.++) { 
c = b[i]; 
a[i] = a[i] + c; 

I 

Figure 49: Example for scalar privatization 

Array Privatization 

This optimization is the same as scalar privatization except that it works on arrays rather than on sca- 
lars. 

Array Merging 

This optimization transforms the data layout of arrays by merging the data of several arrays following 
the way they are accessed in a loop nest. This way, memory cache misses can be avoided. The layout 
of the arrays can be dififerent for each loop nest. Below is tiie example of a cross-filter, where the ac- 
cesses to array a are interleaved with accesses to array &, The picture next to it represents the data lay- 
out of both arrays where bloclcs of a (in green) are merged with blocks of b (m yellow). Unused mem- 
ory space is in white. Thus cache misses are avoided as data blocks containing arrays a and b are 
loaded into the cache when getting data £rom memoiy. IMore details can be found in [ 1 1]. 

for ( j ==1 ; j <=N- 1 ; i++ ) 
forCj^-l; j<=N;j+-l-) 

b[i][j] - 0.25*(a[i-l] [j] + a[i] [j-1] + 
a[i+l] [j] + a[i][j4-l]); 



Figure 50: Example for array merging 

3.2.4 Example of application of the optimizations 

As seen before a lot of optimizations can be performed on loops before and also alter generation of 
vector statements. Finding a sequence of optimizations that would produce an optimal solution for all 
loop nests of a program is still an area of research. Thexefore we can only propose a way to use these 
optimizations that follows a reasonable heuristic to produce vectorizable loop nests. To vectorize the 
code, we can use the Allen-BCennedy algorithm that uses statement reordering and loop distribution 
before vector statements are generated. It can be enhaxiced with loop interchange, scalar e3q)ansion, . 
index set splitting, node splitting, loop peeling. All these transformations are based on the data de- 
pendence graph. A statement can be vectorized if it is not part of a dependence cycl&-, hence optimiza- 
tions are performed to break cycles or, if not completely possible, to create loop nests without depend- 
ence cycles. 

We can divide the whole process in four majors steps. First we should restructure the procedures by 
analyzing the procedure calls inside the loop bodies and try to remove them. Thea some high-level 
dataflow optimizations are applied to the loop bodies to modify their control-flow and sunplify their 
code. The third step would consist in preparing the loop nests for vectorization by building perfect 
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loop nests and ensuring that inner loop levels are vectorizable. Then optiinizations can be perform^ 
that target the architecture and optimize the data, locality. It should also be noted that other optimiza-* 
tions and code transformations can occur between tiiese different steps tliat can also help to furdM^er 
optimize fhe loop nests. 

Hence the first step applies procedure inlining and loop pushing to remove the procedure calls of tlie 
loop bodies. Then the second step consists of loop-invariant code motion, loop unswitching, strength 
reduction and idiom recognition. The third step can be divided in several subsets of optimizations. We 
can first apply loop reversal,, loojp normalization and if-conversion to get oormalized loop nests. Tkiis 
allows to build the data dependency graph. Ihen if dependences prevent the loop nest to be vectorized 
transformations can be applied. For instance if dependences occur only on certain iterations, loop 
peeling or loop splitting can be applied. Node splitting, loop skewing, scalar expansion or statemesnt 
reordering can be applied in other cases. Then loop interchange moves inwards the loop levels without 
dependence cycles. The goal is to have perfectly nested loops with the loop levels canying dependeii<:;e 
cycles as much outwards as possible. Then we can apply loop fusion, reduction recognition, scaLar 
replacement/array contraction and loop distribution to further improve ttie following vectorization. 
Vector statement generation can be performed at last using the AIIen-Keniredy algorithm for instance. 
The last step can consist of optimizations like loop tiling, strip-mining, loop unroUmg and software 
pipelinmg that take into account the target processor. 

The number of optimizations in the third step is large, but not all of them are applied to eadi loop ne st. 
Following the goal of the vectorization and the data dependence graph only some of them are appli&d. 
Heuristics are used to guide the application of ttie optimizations, that can be applied several times if 
neieded. Let us illustrate this with an example. 

void f{int** a, Int** int *c,int int j) { 
a[i]tj] =a[i3[j-l] - b[i+l] ; 

} 

void g(int* a, int:* e,int i) { 
a[i] = c[i] + 2; 

} 

for(i=0; i<N;i++) { 
fcr(j=l; j<9;j=j++) 
if-(k>0) 

f (a,b,i, j); 
else 

g(d/c,j) ; 

} 

d[i] = dti+1] + 2; 

} 

for(i-0; i<N;i++) 

a[i] [i] = b[l] + 3; 

The first step will frnd-that-inliningihe-two procedure calls is possible, then loop unswitching can "be 
applied to remove the conditional instruction of the loop body. The second step begms by applyimig 
loop normalization and analyses the data dependence graph. A cycle can be broken by applying loop 
interchange as it is only carried by the second level. The two levels are exchanged, so that the inimer 
level is vectorizable. Before, that or also after, we apply loop distribution. Xoop fusion can be applied 
when the loop on / is pulled out of the conditional instruction by a traditional redimdant code elimiima- 
tion optimization. Finally vector code can be generated for the resulting loops. 

So in more details, after procedure inlining, we obtain: 
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for (i=0; i<N; i++) { 
for(j«l; j<9;j'=j++) 
if (k>0) 

a[i] [j] = a[i][j-l] - b[i+l] [j-1]; 
else 

d[j] = c[j] + 2; 

} 

ci[i] = d[d.+l] + 2; 

} 

for(i=0; i<N; 

a[i] [i] = b[i] + 3; 

After loop unswitching, we obtain: 

if (k > 0) 

for(i=0; i<N;i++) { 

for{j=X; j<9;j=j++) 

a[i] [j] =a[i][j-l] - bCi+1] [j-1]; 
d[i] = d[i+l] + 2; 

) 

else 

for(i=0; i<lSI;i+-f) { 

for(j=X.;, j<9;j=j++) 

d[j ] = c[j] + 2; 
d[il = d[i+l] + 2; 

} 

for(i=0; i<N; i++) 

a[i] [i] - b[i] + 3; 

After loop nonnalization, we obtain: 

if (k > 0) 

fpr(i=0/ i<N;i4-+) { 

for(j=0; j<8;j=j++) 

a[i] [j+1] -a[i][j] -b[i+.l][j]; 
d[il » d[i+l] + 2; 

else 

for(i=0; i<N;i++) { 

for(j-0; j<8;j=j++) 

d[j ] « c[j+l] + 2; 
d[i] - d[i+l] + 2; 

1 

for (1=0; i<N; i++) 

a[i] [i] = b[i] + 3; 

After loop distribution and loop fusion, we obtaui: 

if (k > 0) 

for (1=0; i<N;i++) 

for(j«0; j<a;j«j++) 

a[i] [j+1] «a[i][j] -bCi+l][j]; 

else 

for(i=0; i<N;i++) 

for(j=0; j<8;.j=j++) 

d[i] = c[j+l] + 2; 

for (1=0; KN; i++) { 

d[i] = dCi+1] + 2; 
a[i] [i] - b[i] + 3; 

) 
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After loop interchange, we obtain: 

if (k > O) 

for(j«0; j<8;j=j++) 
for{i=0; i<N;i++) 

a[i][j+l] = a[i][j] -b[i+l]ej]; 

else 

for(i=^0; i<N;i++) 

fo3c(j=0; j<8;j=j++) 

d[j.] = c[j+l] + 2; 

for{i=0; i<N;i++) { 

d[i] = d[i+l] + 2; 
a[i] [i] - b[i] + 3; 

} 

After vector code generation, we obtain 

if . (Ic > O) 

for{j=-0; j<8;j=j++) 

a iO:N-l] [j+1] =« a[0:N-l] [j] -b[0:N][j]; 

else 

for{i=0; i<N;i++) 

d[0:8] = c[l:9] + 2; 

d[0:N-l] = d[l:N] + 2; 
a[0:N-l] [6:N-1] = b[0:N] + 3p 
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4 Compiler Specification for the 



PACT XPP 



4.1 Introduction 



A cached RISC-XPP architecture exploits its fiill potential on code that is characterized by high data 
locality and high computational * efFort A compiler for this architecture has to consider these design 
constraints. The compiler's primary objective is to concentrate computational expensive calculations 
to ixinermost loops and to make up as much data localily as possible for them. 

The compiler contains usual analysis and optimizations. A.S interprocedural analysis, like alias analy- 
sis, are especially useful, a globa.1 optimization driver is necessary to ensure the propagation of global 
information to all optimizations. The following sections concentrate on the way the PACT XPP mflu- 
ences the compiler. 



Figure 51 shows the main steps the compiler must follow to produce code for a system containing a 
RISC processor and a PACT XPP. The next sections focus on the XPP compiler itself, but first the 
other steps are briefly described. 



4.2 Compiler Structure 
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Code Preparation 



Partiiiozung 



XPP Compiler 



RISC Code Gen. 



RISC Code Sched. 



Figure 51: Global View of the Compiling Process 



4.2.1 Code Preparation 

This step takes the whole program as input and can be considered as a usual compiler front-end. It will 
prepare the code by applying code analysis and optimizations to enable thie compiler to extract as 
many loop nests as possible to be executed by the PACT XPP. Important optimizations are idiom reo*- 
ognition, copy propagation, dead code elimination, and all usual analysis like dataflow and alias analy/- 
sis. 

4.2.2 Partitioning 

Partitioning decides which part of the program is executed by the host processor and which part Us 
executed by the PACT XPP. 

A loop nest is executed by the host in three cases: 

■ if the loop nest is not well-formed, 

■ if the number of operations to execute is not worth it to be executed on the PACT XPP, or 

■ if it is impossible to get a mapping of the loop nest on the PACT XPP. . 

A loop nest is said to be well-formed if the loop bounds and the step of all loops are constant, the loonp 
induction variables are known and if there is only one entry and one exit to the loop nest. 

Another problem arises with loop nests where the loop bounds are constant but unknown at compile 
time. Loop tiling allows to overcome this problem, it will be described below. Nevertheless it could b»e 
that it is not worth it to execute the loop nest on the PACT XPP if the loop bounds are too low. A corm- 
ditional instruction testing if the loop bounds are large enough can be introduced, and 2 versions of thLe 
loop nest are produced. One would be executed on the host processor, and the other on the PACT XPP 
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when the loop bounds are suitable. This would also ease applications of loop transfonnatioiiS:, as pos- 
sible compensation code would be simpler due to the hypodiesis on the loop bounds. 



42.Z RISC Code Generation and Scheduling 

After the XPP compiler has produced NML code for the loops chosen by the partitioning pixase, the 
mam compiling process must handle the code that will be executed by the host processor where in- 
structions to manage the configurations have been inserted. This is the aim of the last two steps : 

■ RISC Code Generation and 

■ RISC Code Scheduling. 

The first one produces code for the host processor and tihe second one optimizes it further by looking 
for a better scheduling using software pipelining for instance. 



4.3 XPP Compiler for Loops 

Figure 52 describes the internal processing of the XPP Compiler. It is a complex cooperation between 
program transforxnations, included m the IXPP Loop Optimizations^ a temporal partitioning phase, 
NML code generation and the mapping of tibe configuration on the PACT XPP. 



exit^ 



yes 




Figure 52:Detaileci Architecture qf the XPP. Compiler 



First loop optimizations targeted at the PACT XPP are applied to try to produce innermost loop bodies 
that can be executed on the array of processors. If this is the case, the NML code generation phase is 
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called, if not then temporal partitioning is applied to get several configurations for Ae same loop. After 
NML code generation and the mapping phase, it can also happen that a configuration will not fit on tli.e 
PACT XPP. In this case the loop optimizations are applied again with respect to the reasons of failinre 
of the NML code generation or of the mapping. If this new application of loop optimizations does not 
change the code, temporal partitioning is applied: Furthermore we keep track of the number of at- 
tempts for the TSTML Code Generation and the moping, if too many attempts are made, and we still d.o 
not obtain a solution, we break the process, and the loop nest will be executed by the host processor. 



4.3.1 Temporal Partitioning 

Temporal partitioning splits the code generated for the PACT XPP ixx several configurations if time 
number of operations, i.e. the size of the configuration, to be executed in a loop nest exceeds the num- 
ber of operations executable in a single configuration. This transformation is called loop disseveriiiLg 
[6]. These configurations are then integrated in a loop of configurations whose number of execution 
corresponds to the iteration range of the original loop. 

4.3.2 Generation of NML Code 

This step takes as input an intermediate form of the code produced by^ the XPP Loop Optimizations 
step, together ^th a dataflow graph built upon it. NML code can then be produced by using tree- or 
DAG-pattem matching techniques. 



4.3.3 Mapping Step 

This step takes care of mapping the NML modules on the PACT XPP hy placing tixe operations on time 
ALUs, FREGs, and BREGs, and routing the data through the buses. 



4.4 XPP Loop Optimizations Driver 

The loop optimizations used for the PACT XPP are now described Tbieur goal is to extract as much 
parallelism as possible from the loop nests in order to execute them otr the PACT XPP by exploiting 
the ALU-PAEs as effectively as possible and to avoid memory bottlenecks with the IRAMs. The fol- 
lowing sections explain how they are organized and how to take into a.ccount the architecture for aj)- 
plying the optimizations. 

4.4.1 Organization of the System 

Figure 53 below presents the organization of the loop optimizations. The transformations are divided 
in six groups. Other standard optimizations and analysis are applied in-between. Each group could be 
called several times. Loops over several groups can also occur if needecdi The number of iterations for 
each driver loop can be of constant value or detemined at compile time by the optimizations itself 
(e.g. repeat until a certain code quality is reached). In the first iteration of the loop, it can be checked if 
loop nests are usable for the PACT XPP, it is mainly directed to check the loop bounds etc. For in- 
stance if the loop nest is well-formed and the data dependence graph does not prevent optimization, 
but the loop bounds are unknown, then in the first iteration loop tiling is applied to get an innermost 
that is easier to handle and can be better optimized, and in the second iteration, loop normalization, df- 
conversion, loop interchange and other optimizations can be applied to effectively optimize the inner- 
most loops for the PACT XPP. Nevertheless this has not been necessary until now with the examples 
presented in the next chapters. 
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Group I ensures that no procedure calls occur Ln the loop nest. Group II prepares the loop bodies by 
removing loop-invariant instructions and conditional instruction to ease tfcie analysis. Group m gener- 
ates loop nests suitable for the data dependence analj^is. Group IV contains optimizations to transform 
the loop nests to get data dependence graphs that are suitable for vectorization. Group V contains op- 
timizations that ensure that the innermost loops can be executed on the PA.CT XPP. Group VI conta.ins 
optimizations that furdier eictract parallelism £rcm the loop bodies. Group VII contains optimizations 
more towards optimizing the usage of the hardware itself. 

In each group die application of die optimizations depends on the result of the analysis and the chiar- 
acteristics of the loop nest. For instance it is clear that not all transformations in Group IV are applied. 
It depends on the data dependence graph computed before. 



Group I 
piDceduie inlining 
loonpnshiriR 



t 

Group D 

loop-invariant code motion 
loop unswitching 
strength reduction 
induction variable elinmiation 



Group in 




loop reversal 




loop nonnalizatioitt 


if<onversion 





Group IV 
loop peeling 
loop splitting 
node spiittins 
^ loop skewing 
scalar expansion 
stateioent reordering 



Group V 

loopinterciuinge 
loop distribution 
loop collapsing 
loop tiling 
stnp>imniia.g 
loopaHgnncient 



Group VI 

loop fusion 

reduction recog;nition 

scalar rqjlacement 

loop imroUing/'»jnrolI&jain 



Group VH 
Data duplication 
Shift register s:yntbesis 
Loop pipBlining- 
Tree balancing 



wo 2004/01556^^ PCT/EP2003/008080 

Compiler Specification for the PACT XPP 
Figure 53.\Detailed View of the XPP Loop Optimizations 



4.42 Loop Preparation 



The optimizations of Groups I, n and HI of the XPP compiler generate loop bodies withotLt procedure 
calls, coaditional instructions and induction variables other ±ian loop control variables. Thus loop 
nests, where the innermost loops are suitable for execution on the PACT XPP, are obtained. The itera- 
tion ranges are normalized to ease data dependence analysis and the application of other code trans- 
formatioas. 



4.4.3 Transformation of the Data Dependence Graph 

The optinnizations of Group IV are performed to obtain innermost loops suitable for vectormzation with 
respect to the data dependence graph. Nevertheless a difference with usual vectorization is that a de- 
pendence cycle, that would normally prevent any vectorization of the code, does not prevent the opti- 
mization of a loop nest for the PACT XPP. If a cycle is due to an anti-dependence, then it could be tihat 
it won't prevent optimization of the code as stated in [7]. Furthermore dependence cycles vvill not pre- 
vent vectorization for the PACT XPP when it consists only of a loop-carried true dependence on the 
same expression. If cycles with distance k occur in the data dependence graph, then this oan be han- 
dled by holding values in registers. This optimization is of the same class as cycle shrinking. 

Nevertheless limitations due to the dependence graph exist I^oop nests cannot be handled if some 
dependence distances are not constanti or unknown. If only a few dependences prevent the optimiza- 
tion of the whole loop nest, this could be overcome, by using the traditional vectorizatiom algorithm 
that sorts topologically the strongly connected components of the data dependence grapk (statement 
reordering), and then apply loop distribution. This way, loop nests, which can be handled by the PACT 
XPP and some by the host processor, can be obtained. 

4.4.4 Influence of the Architectural Parameters 

Some hardware specific parameters mfluence the application oF the loop transformations. Xhe number 
of operations and memory accesses, that a loop body performs, is estimated at each step* These pa- 
rameters influence loop unrolling, strip-mining, loop tiling and also loop interchange (iteration range). 

The table below lists the parameters that influence the applica.tion of the optimizations, for each of 
them two data are given: a startmg value computed from the loop, and a restriction value which is the 
value the parameter should reach or should not exceed after the application of the optimizations. Vec- 
tor length, depicts the range of the iimermost loops, i.e. the number of elements of an array accessed in 
the loop "body. Reused data set size represents the amount of data that must fit in the cache. I/O 
IRAMs, ALU, FREG, BREG stand for the number of IRAMs, ALUs, FREGs, and BRElGs respec- 
tively that constitute the PACT XPP. The dataflow gmph widtli represents the number of operations 
that can be executed in parallel in the same pipeline stage. The dataflow graph height represents the 
length of the pipeline. Configuration cycles amounts to the length of the pipeline, and to the munber of 
cycles dedicated to the control. The application of each optimization may 
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■ decrease a parameter's value {rh 

■ increase a parameter's value (+), 

■ not influence a parameter (id), or 

■ ad^t a parameter's value to fit into the goal size (make fit). 

Furthermore, some resources must be kept for control in the configuration; this means that the optimi- 
zations should not make the needs exceed more than 70-809^ each resource. 



Parameter 


Goal 


Starting Value 


Vector length 


IRAM size (256 words) 


Loop count 


Reused data set size 


Approx. cache size 


Algorithm analysisyioop sizes 


I/OIRAMs 


PACT size (16) 


Algorithm mputs -i- outputs 


ALU 


PACT size (< 64) 


ALU opcode estunate 


BREG 


PACT size (< 80) 


BREG opcode estimate 


FREG 


PACT size (< 80) 


FREG opcode estizkiate 


Data flow graph width 


High 


Algorithm data flow graph 


Data flow graph height 


Small 


Algorithm data flow graph 


Configuration cycles 


^ conunand line parameter 


Algorithm analysis 



Here are some additional notations used in the following descriptions. Let/i be the total number of 
processing elements available, the width of the dataflow graph, in, the maximum number of input 
values in a cycle and out, the maximum number of output: values possible in a cycle. On the PACT 
XPP, n is the number of ALUs, FREGs and BREGs available for a configuration, r is the number of 
ALUs, FREGs and BREGs that can be started in parallel in the same pipeline stage and, m and ota 
amount to the number .of available ERAMs. As IRAMs have 1 input port and 1 output port, the number 
of IRAMs yields directly the number of input and output date. 

The number of operations of a. loop body is computed by adding all logic and arithmetic operations 
occurring in the instructions. The number of input values is the number of operands of the instructions 
regardless of address operations. The number of output valmes is the number of output: operands of the 
instructions regardless of address operations. To determine the number of parallel operations, iiiput 
and output values, and the dataflow graph must be considered. The effects of each tr^sformation on 
the architectural parameters are now presented in detail. 

Loop interchange 

Loop interchange is applied when the innermost loop has a too narrow iteration range. In that case, 
loop interchange allows to have an iimermost loop with a more profitable iteration range. It can also be 
influenced by. the layout of the data in memory. It can be profitable to data locality to interchange two 
loops to get a more practical way to access arrays m the caohe and therefore prevent cache misses. It is 
of course also mfluenced by data dependences as explamed earlier. 
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'Parameter 


Effect 


Vector leneth 


+ 


Reused dzXd set si2e 


make fit 


I/OIRAMs 


id- 


ALU 


id 


BREG 


id 


FREG 


a 


Data flow graph width 


id 


Data flow graph height 


id 


Conflguration cycles 





Loop Distribution 

Loop distribution is applied if a loop body is too big to fit on the PACT XPP. Its main effect is to re- 
duce the processing elements needed by the configuration.. Reducing the need for IRAMs can only be 
a. side effect 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/OIRAMs 


make flt 


ALU 


make flt 


BREG 


make flt 


FREG 


make flt 


Data flow graph width 




Data flow graph height 




Configuration cycles 





Loop Collapsing 



Loop collapsing can be used to make the loop body use more memoiy resources. As several dimen- 
sions are merged, the iteration range is increased and the memory needed in increased as welL 
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Pa ram eter 


Effect 


Vector lenstfa 


+ 


Reused dat& set size 




I/O IRAMs 


+ ■ 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph widtii 


+ 


Data flow grapli height 


+ 


Configuration cycles 


+ 



LoopTilmg 

Loop tiling, as multi-dim ensional strip-mining, is ii^fluenced by all parameters, it is especially useful 
when fhe iteration space is by far too big to fit in the IRAM, or to guarantee maximum execution time 
when the iteration space is unbounded (see Section 4.4.6). It can then make thie loop body fit with re- 
spect to the resources of the PACT XPP, namely the IRAM and cache line si^es. llie size of the tiles 
for strip-mining and loop tiling can be computed like this: 

tile size = resources available for the loop bc^dy / resources necessary jfor the loop body 

The resources available for the loop body are the wtiole resources of the PACT XPP for this configu- 
ration. A tile size can be computed for the data and another one for the processing elements, the final 
tile size is then the minimum between these two. For instance, when the amount of data accessed is 
larger than the capacity of the cache, loop tiling can be applied like below. 

for(i=0;i <= 1048576;i++) for(i=0; i<= 1048576; i+= CACHE_SIZE) 

<loop body> fO3r(j«0; j< CACHE_SIZE; j+=IRAM_SIZE) 

for(k«0; k<IRAM_S ZZE; k++) 
<tiled loop body> 



Figure 54: Example of loop tilingfor the PACT XPP 
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Parameter 


Effect 


Vector lensdi 


make £t 


Reused data set size 


make dSt 


I/0IRA]N4s 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph-height 




Configuration cycles 





Strip-Mining 

strip-mining is used to make the amount of memory accesses of th^ innermost loop fit wi^ the 
IRAMs capacity. Hie processing elements do not usually represent a problem as the PACT XPP has 
64 ALU-PAEs which should be sufficient to execute any single loop body. Nevertheless, the ncimber 
of operations can be also taken into account the same way as the data. 



Parameter 


Effect 


Vector length 


make £t 


Reused data set size 


id 


I/OIRAlMs 




ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 




Data flow graph height 




Configuration cycles 





Loop Fusion 

Loop fusion is applied when a loop body does not use enough resources. In this case several loop 
bodies can be merged to obtain a configuration using a larger part of the available resources. 
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Parameter 


Effect 


"Vector lengdi 


id 


Reused data set size 


id 


I/O JRAMs 


+ • 


ALU 


+ 


BREG 


+ 


FREG 


+ 


Data flow gr^h width 


id 


Data flow graph height 


+ 


Configuration cycles 


4- 



Scalar Replacement 

The amount of memory needed by the loop body should always fit in the IRAMs. Thanks to this opti- 
mization, some input or output data represented by array referenc^es, that should be stored in IRAMs, 
are replaced by scalars, Ihiat are either stored in FREGs or kept on buses. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/O IRAMs 


id* 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Conflguration cycles 


id 



Loop Unrolling 

Loop unrolling, loop collapsing, loop fusion and loop distribution are influenced by the number of 
operations of the body of the loop hest and the number of data inputs and outputs of theso operations, 
as they modify the size of the loop body. The number of operations should always be smaller than/i, 
and the number of mput and output data should always be smaller than and ota. 
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Parameter 


£ffect 


Vector length 


id 


Reused data set size 


id. 


I/OIRAMs 


+ 


ALU 


+ 


BREQ 


+ 


FREG 


+ 


Data flow graph width 


id 


Data flow graph heigjit 




Configuration cycles 


+ 



Unroll-and-Jarn 

UnrolI-and-Jam consists in unrolling an outer loop and then merging the inner loops. It must compute 
the unrolling degree u with respect to the number of input memory accesses m and output memory 
accesses /? in the inner loop. The following inequality must hold: u*m^in/\u* out. Moreover 
the number of operations of the new inner loop must also fit on the PACT XPI*. 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


+ 


I/OIRAMs 


+ 


ALU 


+ 


BREG 


+ 


FREQ 


+ 


Data flow gr^h width 


id 


Data flow graph height 


+ 


Confilguration cycles 


+ 



4A5 Optimizations Towards Hardware Improvements 

At this step other optimizations, specific to the PACT XPP, can be made. These optimizations deal 
mostly with memory problems and dataflow considerations. This is the case of shift register synthesis, 
input data duplication (similar to scalar privatization), or loop pipelining. 

Shift Register Synthesis 

This optimization deals with array accesses that occur during the execution of a loop body. When sev- 
eral values of an array-are-alive-for-different iterations, it can be convenient to store them in registers 
rather than accessing memory each time they are needed. As the same value must be stored in different 
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registers depending on the number ojf Iterations it is alive, a value shares several registers and flows 
from a register to another at each iteration. It is similar to a vector xegister allocated to an array access 
with the same value for each element. This optimization is performed durectly on the dataflow graph 
by inserting nodes representing registers when a value must be stored m a register. In the PACT XPP, 
it amounts to store it in a data register. A detailed explanation can be found in [1]. 

Shift register synthesis is mainly suitable for small to medium amounts of iterations where values are 
alive. Since the pipeline length increases with each iteration for whiich the value has to be buffered, the - 
following method is better suited for medium to large distances between accesses in one inpixt array. 

Nevertheless this method works very well for image processing algorithms which mostly alter a pixel 
by analyzing itself and its surrounding neighbors. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/O RAMS 


id 


ALU 


id 


BREO 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 



Input Data Duplication 

This optimization is orthogonal to shift register synthesis. If differ^ent elements of the same array are 
needed concurrently, instead of storinig the values in registers, the same values are copied in different 
IRAMs. The advantage against shift register synthesis is Hie shorter pijpeline length, ^d Oi&refbre the 
increased parallelism, and the unrestricted applicability. On the otlier hand, the cache-IRAJNf bottle- 
heck can affect the performance of this solution, depending on the amounts of data to be mo^ved. Nev- 
ertheless we assume tihiaLcacheJRANl transfers are negli^ble to transfers in the rest of the memory 
hierarchy - 
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Effect 




+ 




id 


T/O TRAMs 


id 




id 


BREG 


id 


FREG 


id 


Data flow graph widtfi 


+ 


Data flow graph height 




Configuration cycles 


id 



Loop Pipelining 

This optimization consists in synchronizing operations by inserting delays in the dataflow graph. 
These delays are registers. For the PACT XPP, it amounts to store values in data registers to delay the 
operation using them. This is the same as pipeline balancing performed by xmap. 



Parameter 


Effect 


Vector length 


+ 


Reused dataset size ' 


id 


I/OIRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


+ 



Tree Balancing 



This optimization consists in balancing the tree representing -the loop body. It reduces the depth of the 
pipeline, tiius reducing the execution time of an iteration, and increases parallelism. 
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Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/OIRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 





4.4.6 Limiting the Execution Time of a Configuration 

•The execution time of a configuration must be controlled. This is ensured in tli.e compiler by strip- 
mining and loop tiling that take care that not more input data as the IRAMs capacity come in the 
PACT XPP in a cycle. This way the iteration range of the innermost loop that is ejcecuted on the PACT 
XPP is limited, and therefore its execution time. Moreover partitioning ensures Hxat loops, whose exe- 
cution count can be computed, at run time, are going to be executed on the PACT XPP. This condition 
is trivial for for-loops, but for while-loops, where the execution count cannot de determined statically, 
a transformation like sketched below can be ^plied. A.s a result, the inner for-loop can be handled by 
the PACT XPP. 

while (ok) { while (ok) 

<loop body> for (i=0; i<100 && ok; i-f+) { 

} <loop body> 

) 



Figure 55: Transformation of while4oops 
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5 Case Studies 



5.1 3x3 Edge Detector 



5.1.1 Original Code 



Source Code: 

#define VERLEN 16 
#clefine HORLEN 16 
main ( ) { 

iht V, inp; 

int pi [VERLEN] (HORLEN] ; 

int p2 [VERLEN] [HORLEN] ; 

int htmp, vtmp, sum; 

for(v«0; y<VERLEN; v++) // loop nest 1 

for(h=0; h<HORLEN; { 

scanf("%ci"^ &pl [v] [h] ) ; // read input pixels to p 1 
p2[v][h] = 0; // initialize p2 

} 

for(v=0; v<=VERLEN— 3; v++) { // loop nest 2 
for(h=0; h<=H0RLEN-3; h++) { 

htmp « (pl[v+2] [h] - pl[v] [h]) + 

(pl[v+2] [h-t-2] - pl[v][h+2]) -I- 
2 * (pi Cv+2] [h+1] - pl[v] [h+1] ) ; 
if (htmp < 0) 
htmp =» - htmp; 

vtmp = (pl[v] [l:i.+2] - pi [v] [h] ) + 

(pl[v+2] [h+2] - pl[v+2][h]) + 
2 * (plCv+l] [h+2] - pl[v+l] [t^]); 
if (vtmp < 0) 
vtmp = - vtmp; 

sum = htmp + vtmp; 
if (sum > 255) 
sum = 255; 



for(h-0; h<HORLElsr; h++) 

printf ("%d\n"^ p2[v][h]); // print output pixels from p2 



p2[v+l] [li+l] =» sum; 



for(v=0; v<VERLEN; v++) 



// loop nest 3 
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5.12 Preliminary Transformations 

Interprocedural Optimizations 

The first step normally invokes interprocedural transformations like fimction inlining and loop push- 
ing- Since no procedure calls are within the loop body, these transformations are not applied to this 
example. 

Partitioning 

The partitioning algorithm chooses which code runs on the RISC processor armd which code runs on 
the XPP. Since we only consider inner loops to ruxx on the XPP, the basic blocks are annotated with the 
loop nest depth. Thus basic blocks which are not in a loop are separated out. Furthermore function 
calls withm a loop body prevent a loop to be considered for running on the XPP. 

In our benchmark the loop nests 1 and 3 are mariced as to run on the RISC host because of the function 
call. In the following sections they are not considered any further. 

It is to say that at this compilation stage it is not predictable if the remaining loop nests can be synthe- 
sized for the XPP. We just separated the ones wliich definitely cannot run on it, others may follow, 
smce running the code on the RISC CPU is always the reassurance in our strategy, 



Loop Analysis and Normalization 

The code upon has already normalized loops. Nevertheless it is more likely tii^at human written code 
would look like 

for(v-^l; V < VERIiEN - 1; v++) { 
for(h=l; h < HORLEN - 1; h++) { 

htmp - {pl[v+l] [h-1] - pl[v-l] [la-l]) + 
(plCv+1] th+11 - pl[v-l] Cli+1] ) + 
2 * (pl[v+l] [h] - plCv-L ] [h] ) ; 
if {htmp < O) 
htmp = - htmp; 

vtmp = {pl[-v-l] [h+I] - pl[v-i] [li-l] ) + 

(pi [h+1] - pl[v+l] fh-1] ) + 

2 * (pl[v][h+l] - pl[v][h-l]); 
if (vtmp < O) 
vtmp = - -vtmp; 

sum = htmp + vtmp; 
if (sum > 255) 

sum = 255 ; 
p2[v+l][h+l] = sum; 

} 

} 

Although seen at first sight by a human reader, it is not obvious for the compiler that the loop is well 
formed. Therefore it is tried to normalize the loop. 

If the original loop induction variable is called/ with the increment value s and lower and upper loop 
bounds 1 and u, respectively, then the normalized loop with the induction variable i' and the upper 
bound tf (the lower bound 1' is'O by*defihition) is transformed as follows: 
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■ The upper bouad calculates to = (u-l)/s. 

■ All occurrences of i are replaced by I + s. 

Applied to the code above, the loop statemeat for (v=l; v < VERIEW - 1; v++) with the 
lower bound vl = 1, the upper bound vu = 14 ( < 15 means <>= 14 in integer arithmetic) and the incre— 
ment vs = 1 transforms to 

for(vn=0; vn <= (vu - vl)/vs; vn++) 
or simplified 

for(vn=0; vn <= 13; vn++) 

The *h-loop* is transformed equally, issuing the origmal code. 

Idiom Recognition 

In the second step idiom recognition finds the absQ and minQ structures in Uhe loop body. Please nots 
that alttiough the XPP has no abs opcode, it can easily be synthesized and should therefore be pro- 
duced to simplify the internal representation (otherwise if-conversion has to handle this case ^icbi 
increases the complexity). 

Therefore the code after idiom recognition looks like (absQ and minQ are <:ompiler known functions 
which are directly mapped to XPP opcodes or predefined NML modules) 

for(v=0; v<=16-3; v++) { 
for(h=0; h<=16-3; h++) { 

htmp = {pi [v+2] [h] - pl[v][h]) + 

(pi [v+2] [h+2] - pl[v][i^+2]) + 
2 * (pi [v+2] [h+1] - pi Lv] [h+1] ) ; 
htmp " abs (htmp) ; 

vtmp = (pl[v][h+2] - pl[v][h]> + 

{pl[v+2] [h+2] - pi [v+2] [h]) + • 

2 * (pl [v+l] [h+2] - pl[v+l] [h]); 
vtmp = abs (vtmp) ; 

sum — min(htunp + vtmp, 255); 
p2[v+l] [h+-l] = sum; 

} 

} 

Dependency Analysis 

for(v=0; v<=16-3; v++) { 



for(h=0; h<=16-3; h++) { 



SI 



S2 



htiirip 



htiinp 



(pl[v+2] [h] - pl[v] [h]) + 
(pi [v+2] [h+2} - pi [V] [h+2]) + 
2 * (pi [v+2] [h+1] - pi [v] [h+1]); 
abs (htmp) ; 



S4 



S3 



vtmp = (pi [v] [h+2] - pl[v][h]) + 

(pi [v+2] [h+2] - pl[v+2][h]) + 
2 * (pl[v+l][h4-2] - pl[v+l] [h]) ; 

vtzmp = abs(vt^np); 




Figure 56 The expressiorB tree of the edge 3x3 inner loop body 



35 Slim = min(htmp + vtmjp, 255); 

S6 p2[v+l][h+l] = sum; 

) 

.} 

There are no loop carried dependencies whioh prevent pipeline vectorization. The loop independent 
scalar dependencies do not prevent pipeline vectorization since the transformation does not disturb the 
order of reads and writes. Furdiennore foiward expression substitution / dead code elimination ^ill 
remove the scalars completely. 



5.1.3 Pre Code Generation Transfomnations 

Forward Expression Substitution / Dead Code Eiimination 

The lack of uses of htmp, vtmp and sum after the loop nest allows forward e^ession substitution 
along with dead code elimination to place the -whole calculation into one statement 

p2[v+l] [h+l] — min(abs( {pl[v+2]Ch] -pl[v][h)) + 

(pl[v+2] tli+2] pl[v][h+2]) + 
2 * <pl[v-H2] [h+l] -.pl[v] [h+l] > ) 
+ abs( (pl[v][ln-2] -pl[v][h]) + 

{pl[v+2] [li+2] - pl[v+2]th]) + 

2 * (pl[v-»-l] [h+2] - pl[v+l][h]> ), 255); 



The scalar accesses then disappear completely. 
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The array accesses are mapped to IRAMs. At tliis stage the IRAM numbexs are chosea arbitrarily, tlie 
actual mapping to XPP IRAMs is done later. 

Therefore we rename pl[v+x][h+y] and p2[v-+«][h+y] to iramN[y]) (e.g. p I[v+2][h] to iram2[0]). Ttie 
code reads tiien 

iramS [ 1 ] = min ( abs_(Jj::am2^-0-]— - iramO [ 0 ] ) + 
(iram2[2] - irainO[2]) + 
2 * (iraiii2[l] ^ iramOtl]) + 
abs (iramO [2] - iramofo]) + 
(iram2[2] - irain2[0] + 
2 * (iraml[2] - irainl[0]), 255); 

Tree Balancing 

The visualized expression tree in Figure 56 shows another valuable optimization before matching tlie 
tree. Since the depth of the tree determines the length of the synthesized pipeline, another simplifica- 
tion can decrease tiiis depth. In both of the main sub trees the operands of the commutative add ex- 
pressions can be interchanged to reduce the overall tree depth. 




Fig^re 57 One of the sub trees before and after bal<mcing. The numbers represent the annotated maximum tre& 

depth from the node to its deepest child leaf node 

The resulting expression tree is shown in Figur<» S7. 



5.1.4 XPP Code generation 

Pipeline Synthesis 

As already stated the pipeline is synthesized by a dynamic programming tree matcher. In contrast to 
sequential processors it does not generate ins-tructions and register references but PAE opcodes aaid 
port coimections. The main calculation network is shown in Figure 58. The input data preparation 
network is not shown in this figure. The case of synthesized shift registers are shown in Figure 59, 
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while the variant with duplicated input data simply consists of an IRAM for each input channel in 
Figure 58. 

Although -this is straight forward, there remains the question how to access the different offsets of the 
vector register accesses. Although the RAM-PAEs are dual ported it is obvious that it is not possible to 
read different addresses concurrently. 

Since it is not efficient to synthesize a configuration which generates the different addresses sequen- 
tially and demultiplexes the read operands into different branches of the data flow, other arrangements 
have to be made. 

The two possibilities to access input data presented in subsection 4.4,5 yield the following in RISC 
pseudo code and XPP utilization.. The pseudo code running on the RISC core looks like 

XPPPreload ( config) 
fox{v=07 v<=16-3; v++) { 

XE>PPreload(0, &pl[v], 16) 

XE>PPreload(l, &pl [v+1] , 16) 

XE>PPreload(2, &pl [v+2] , 16) 

XE>PPreloadClean ( 3 , &p2 [ v+1 ] , 16) 

XE>PExecut:e( config r IRAM(O), IRAM(l), IRAM(2), IRAM(3)} 

} 

for shift register synthesis and like 

XPPPreload ( config ) 
for(v=0^ v<=16-3; v++) { 

XE»PPreload(0, &pl [v] , 16) 
XE»PPreload(l, &pl[v], 16) 
XE>PPreload(2, -&pl[v], 16) 
XE>PPreload { 3 , &pl [ v+1 ] , 16) 
XE»PPreload(4, &pl[v+l]', 16) 
XPPPreload ( 5 , &pl [ v+2 ] , 16) 
XE>PPreload(6, &pl[v+2], 16) 
XE>PPreload(7, &pl [v+2] , 16) 
XE>PPreloadClean { 3 , &p2 [ v+ 1 ] 
XE>PExecute ( config, IRAM ( 0 ) , 

IRAM (4) , 



, 16) 
IRAM(l), 
IRAM(5) , 



IRAM(2 ) 
IRAM(6 ) , 



IRAM(3) ) 
IRAM(7).) 



for data duplication^ respectively. 
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Figure 58 The m^iin calculation network of the edgeBxS configuration. The MULTSORT combination does the 
absQ calculation ^vMe the SORT does the minO calculation. 
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Figwre 5P One input after shift register syrrthesis. The leftmost input contains plOM* middle on^ 
pWl^^^ J ^ fhe rightmost pi Of ^"^2], resp&ctivefy. 

The values for place & route and simulation are compared in the followiixg table. Note that a common RJSC DSP 
with two MAC units and hardware loop support needs about 4000 cycles for the same code. This comparison 
does not account for cache misses. Furtfaeraiore it is obvious, that tiie nuinber of input values is very small in this 
example and the DSP calculation time is proportional to that number. The XPP performance on the othear hand 
will improve with the number of input vjdues. Therefore the XPP perfomance will be more impressive with 
bigg^ image sizes. 



Parameter 


.Value (shift register synthesis) 


Value (data duplication) 


Vector length 


16 


16 


Reused data set size 


256 


256 


I/piRAMs 


31+10=4 


81 + 10 = 9 


ALU 


27 


21 


BREG 


21 (1 defined + 20 route) 


10(1 defined + 9 route) 


FREG 


22 (9 defined + 23 route) 


19 (3 defined + 16 rou^e) 


Data flow graph width 


14 


14 


Data flow graph height 


3 (shift registers) + 8 (calcttlation> 


- 8 (calculation) 


Configuration cycles (simulated) 


configuxation 


226Z 


configuration 


2145 




preloads^ 


14*3*4 165 


preloads 


8*8*4 256 




cycles 


14*57 795 


cycles 


14*52 728 




sum 


3225 


sum 


3129 



* assuming 4 words/cycle burst transfer 



f 
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5.1.5 Enhancing Parallelism 

After the synthesis the configuration calculating the inner loop utilizes 27 ALUs and 4 lR/!\Ms for shift 
register synthesis and 21 ALUs and 9 IRAMs for data duplication, respectively. Assuming a XPP64 
core this leaves plenty of room for further optimizations. N&\^ertheless, since all optimizations en- 
hancing parallelism are perfomied before the synthesis takes place, it is crucial that tfaey^ estimate the 
needed resources and the benefit of the transformation very carefUlly. Furthermore thej^ have to ac- 
count for both input preparation strategies to estimate correct values. 

Loop Unrolling 

Fully unrolling the inner loop would not lead to satisfying results, because the number of inputs and 
outputs increases dramatically. That means data duplication would not be applicable and shift register 
synthesis would exhaust most of the benefits of the parallelism, by producing a very long pipeline for 
each data flow graph. Although partial unrolling of the uiner loop would be applicable it promises not 
much benefit for the area penalty introduced. 

Loop unrolling the outer loop is also not applicable since it produces a further configuration. Never- 
theless a. related transformation could do a good job on this loop nest 



Unroll-and-Jam 

The unroU-and-jam algorithm enhances parallelism and also improves DRAM usage. It brings pairs of 
iterations together ideally reusing IRAM outputs and calculation results. The algorithm partially un- 
rolls the outer loop and fuses the originated inner loops. Before the unroll-and-jam is performed the 
so-called unroU-and-jam factor must be determined which denominates the unrolling factor of the 
outer loop. Tliis is mainly influenced by .the number of ALUsn <- 64 assuming XPP64) armd calculates 

to c^.and.j«n = -^^^^ " S " ^ ^^^^^^ division). 

• ''inner loop ^' 

Thus the source code would be transformed to. 

for(v=0; v<=VERLEN-3; v-l-=2) { 

for(h=0; h<=H0RLEN-3; h++) { 

p2[v+l][h+l] = min( abs { (pi [v+2] [h] - pi [v] [h] ) + 

(pl[v+2] [h+2] -pl[v][h+2]) + 
2 * {pl[v+2] [h+l_] - pl[v][h+l])) + 
abs({pl[v] [h+2] -pl[v][h]) + 

(pl[v+2] [h+2 ] -pl[v+2][h]) + 
2 * (pl[v+l][h+2:] - pl[v+l] [h])), 255); 
p2[v+2][h+l] == min( abs ( (pi [v+3] [h] - pl[v+l][h]) + 

(pl[v+3] [h-h 2] - pl[v+l] [h+2]) 
2 * (pl[v+3] [h+L] - pl[v+l] [h+1])) 
abs((pl[v+l] th+Z] -pl[v+l][h]) + 
(pl[v+3][h+2] - pl[v+3][h]) + 
2 * (pi [v+2] [h+2] - pi [v+2] [h])), 2S5); 

} 

} 

The transformation introduces additional accesses to pl[v+-3] £h], pl[v+3][h+2], pl[v+-3][h+l], and 
pl[v+l][hH-l] (the former hole in the access pattern) as well as a write access to p2[v+2l][h+l]. That 
means 2 IRAMs more for shift register synthesis (one input, ooe output) and S IRAMs loore for data 
duplication (4 input, 1 output), while performance is doubled. 
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Parameter 


Value (shift roister synthesis) 


Value (data dupMication - no 
IRAM placement) 


Vector length 


16 


16 


Reused data set size 


256 


256 


1/OIRAMs 


4I+20=6 


12 1+2 O =14 


ALU 


45 


37 


BREO 


31 (12 defined +19 route) 


42 (4 defined +- 38 route) 


FREG . 


29(1 defined + 28 roTUte) 


18 (1 defined 17 route) 


Data flow gjcaph width 


14 


14 


Data flow graph height 


3 (shift registers) + 8 (calcsulation) 


8 (calcula-tion) 


Configuration cycles (simulated) 


configuration 


2753 


configuration 


2754 




preloads 


7*4*4 112 


preloads 


7*12*4 336 




cycles 


7*53 371 


cycles 


7*69 483 




sum 


3236 


sum 


3573 




Parameter 


Value (data dupUcatiom - with 
IRAM placement^ 




Vector length 


16 




Reused data set size 


256 




I/OIRj\Ms 


121 + 2 0 = 14 




ALU 


37 




BREa 


36 (4 defined + 32 route) 




FREG 


24 (1 defined + 23 romite) 




Data flow graph width 


14 




Data flow graph height 


3 (shift registers) + 8 (calculation) 




Configuration cycles (simulated) 


configuration 
preloads 
cycles 
sum 


2768 

7*12*4 336 
7*51 357 
3461 







The simulated results are shown in the table above. Please note the differences of th« two columns 
labeled with "data duplication". The first used xmap to place the IRAMs, while in the second the 
IRAMs were placed by hand using a greedy algorithm whcich places IRAMs that are operands of the 
same operator in one line (as long as this is possible). Th& second solution improved ttie iteration cy- 
cles by 1 8. This shows that IRAM placement has a great inK.pact to the final performance. 

The traditional unroU-and-jam algorithm uses loop peeling to split the outer loop in a preloop and an 
unroll-able main loop to handle odd loop counts. When we assume for instancen = 128 the unroU-and- 
jam factor would calculate to 



^ 128 ^ 

^unroll-and-jam ^ ' 
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Since the outer loop count (14) is not a multiple of 4, the algorithm virtually peels off the first two 
iterations and fiiaes the two loops at the end adding guajcds to the inner loop body^. Then the code looks 
like (guards emphasized) 



for ( v=0 ; v<=VERLElSJ^5 ; . v+=4 ) { 

for(h=0; h<-aORLEN-3; h++) 



{ 



±f(v>0) 



±f(v>l) 



p2[v+l] [h+1] = min( abs ( (pl[vH-2] [h] - 

(pl[v-H2] [h+2] 
2 * (pl[vH-2] [h+1] 
abs((pl[v] [h+2] - 
(pl[v+-2] [h+2] 
2 * (pl[v+-l] [h+2] 
p2[v+2][h+l] = min( abs ( (pi [vH-3] [h] - 

(pl[v-+3] [h+2] 
2 * (pl[v-H3] [h+1] 
abs( (pl[v+-l] [h+2] 

{pl[v-+3] [h+2] 
2 * (pl[v-H2] [h+2] 
p2[v+3][h+l] = min{ abs ( (pi [vh-4] [h] - 

(pl[Tr+4] [h+2] 
2 * (pl[v+-4] [h+1] 
abs( (pl[v-i-2] [h+2] 

{pl[v-+4] [h+2] 
2 * (pl[v"H3] [h+2] 
p2[v+4][h+l] » min( abs ( {pi [vH-5] [h] - 

(pl[^^+5] [h+2] 
2 * (pl[v-H5] [h+1] 
abs( (pl[v+-3] [h+2] 
(pl[v-+5] [h+2] 
2 * (pl[vH-4] [h+2] 



pl[v][h]) H- 

- pl[v][h+2]) + 

- pl[v][h+3.])) + 
pl[v] [h]) -+ 

- pl[v+2] [tn]) + 

- pl[v+l] [In])), 255). 
pl[v+l] [h] ) + 

- pl[v+l] Ch+2]) + 

- pl[v+l] [ti+1])) + 

- pl[v+l] [ti]) + 

- pl[v+3] Ch]) + 

- pl[v+2] [ti])), 255). 
pl[v+2] [h] ) + 

- pl[v+2] Ch+2]) + 

- pl[v+2] [ti+l])) + 

- pl[v+2] [In]) + 

- pl[v+4] Ch]) + 

- pl[v+3] [ti])), 255). 
pl[v+3] [h] ) + 

- pl[v+3] Ch+2]) + 

- pl[v+3] [ti+l])) + 

- pl[v+3] [In]) + 

- pl[v+5] Ch]) + 

- pl[v+4] [ti])), 255). 



5.1.6 Parameterized Function 

! 
I 

Source code 

The benchmark source code is not veiy likely to be written in that form in re^l world applications. 
Normally it would be encapsulated in a function with, parameters for input aad output arrays along 
with the sizes of the picture to work on. 

Therefore the source code ivould look similar to: 

void edge3x3(int *pa., int *p2, int HORLEN , int VERLEN) 
{ 

for(v=0; v<=VERLE'N-3; v++) { 
for(h=0; h<=H0RriEN-3; h++) { 

htmp = (**(pl + (v+2) * HORLEN + h) - **(pl + v * HORLEN + h) ) + 
{**(pl + (v+2) * HORLEN + h+2) - **(pl + v * HORLEN + h+2)) + 
2 * (**{pl + (v+2) * HORLEN + h+a) - **(pl + v * HORLEN + h+1) ) ; 
if (htmp < 0) 

htmp = — htmp; 

vtmp = {**(pl + V * HORLEN + h+-2) - ** (pi + v * HORLEN + h) ) + 

(**(pl + (v+2) * HORLEN + h+:2) - **(pl + (v+2) * HORLEN + h)) + 
2 * (**(pl + (v+1) * HORLEN + h+2) - **{pl + (v+1) * HORLEN + h) ) ; 
if (vtmp < 0) 

vtmp « — vtimp; 



siam = htmp + -vtmp; 
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if (sum > 255) 

sum — 255; 
**(p2 + (v-hl) * HORLEN + h+1) = sum; 

} 

}> 

This requires some additional features from the compiler. 

■ interprocedural optimizations and analysis 

■ hints by the Programmer (e.g. a compiler known assert(VERLEN % 2 == 0) makes unroU-and-jam 
actually possible without peeling off iterations and running them conditionally) 

Fitting tlie Algorithm Optimally to the Array 

Since HORLEN and VERLEN are not knom at compile time these unknown parameters introduce 
some constraints which prevent pipeline vectori2:ation. The compiler mus^ assume that the IRAMs 
cannot hold all HORI^HN input values in a row, so pipeline vectorization woxild not be possible. 

Strip Mining Inner Loop 

Strip mining partitions the inner loop into a loop that runs over a strip, wh.ich is chosen to be of the 
same size as the IRAlVIs can hold and a by strip loop iterating over the strips. Of course the strip loops 
upper bound must be adjusted for the possible incomplete last strip. After tti.e strip mining the original 
code would look like <outer v-loop neglected): 

for(h=0; h <= HORLEN-3; h+= stripsize) 

for(hh=h; h<=min(h+stripsize-l, HORLEN-3); hh++) { 

htmp » (** (pi + (v+2) * HORLEN hh) - **(pl + v * HORLEN + hh) ) + 

} 

} 

Assuming a IRAM size strip size of 256 the following simulated results can. be obtained for one strip. 
The values must be nxultiplied with the number of strips to be calculated. 



Parameter 


Value (shift register synthesis) 


Value (data duplication - with 
IRAM placement) 


Vector length 


16 


16 


Reused data set size 


256 


256 


I/O IRAMs 


4I+20 = 6 


121 + 20=14 


ALU 


-45 


37 


BRBG 


31 (12 defined + 19 route) 


4Z (4 defined + 38 route) 


FREG 


29 (1 defined + 28 route) 


IS (1 defined +17 route) 


Data flow graph width 


14 


14 


Data flow graph height 


3 (shift registers) + 8 (calculation) 


8 (calculation) 


Configuration cycles (simulated) 


configuration 


2753 


configiuration 


2754 




preloads 


7*4*64 1792 


preloads 


7*12*64 5376 




cycles 


128*530 67840 


cycles 


128*553 70784 




sum 


72385 


sum 


78914 



The RISC DSP needs about 1.47 million cycles for this amount of data, ^s mentioned above these 
values do not include cache miss penalties and truly underestimate the real values. Furthermore it can 
be seen that data duplication does not improve the performance. The reason for this seems to be an 
^orse placement and routing. 
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5.2 FIR Filter 



52.1 Original Code 

Source code: 

#define N 256 
#define M 8 

for (i =0; i < N-M+1; i++) { 
S: y[i] = Or . . 

for (j = Op j < M; j++) 
S': y[i] += c[j] * x[i+M-j-l]; 
) 

The constants N and M are replaced by their values by the pre-processor. The data dq)endence 
is the following: 




for (i = 0; i < 269; i++) { 
S: y[i] = 0; 

for (j = 0; j < 8; j++) 
S': y[i] += c[j] * x[i+7-jI; 
} 

We have the Allowing table: 
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Parameter 


VisSue 


vector lengtn 




Reused data set size 




I/O IRAMs 




ALU 










O 


Data flow graph width 


L 


Data flow graph height 




Configuration cycles 


2+8=10 



5.22 First Solution 

In the case we want to save memory, the straightforward solution is to -iinroll the inner loop and t<> use 
shift register synthesis to delay the values of array jc in the pipeline. ^Tc other optimization is applied 
before as either they do not have an effect on the loop or they increase the need for IRAMs. After Loop 
unrolling, we obtain the foUowing.code: 

for (i =0; i- < 269; i++) { 
y[i] = 07 

y[i] += c:[0] * x[i+7] ; 
y[i] += o[l] * x[i+6] ; 
y[i] += G[2] * x[i+5]; 
y[i] += c[3] * x[i+4]; 
yti] += c[4] * x[i+3]; 
y[i] += c[5] * x[i+2]; 
y[i] += o[6] * x[i+l] ; 
y[i] += c:[7] * x[i]; 

} 

Then the table looks like this: 



Parameter 


Value 


Vector length 




Reused data set size 




yO IRAMs 




ALU 




BREG 


O 


FREG 


O 


Data flow graph width 




Data flow graph height 




Configuration cycles 


9+26Q=278 
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Dataflow analysis reveals ±Bt y[0]=J(x[Ol.,,,xl7]), y[lM(x[JL ..,^[8]),...,y[i]^fm xfi^lj). 

Successive values of ^ depend on almost the same successive values of x To prevent unnecessary 
accesses to the IRAMs, the values of x needed for the computation of the next values of 3; are kept in 
registers. In our case this shift register synthesis needs 7 registers- This will be achieved on the PACT 
XPP, by keeping themnntcrFREGsTTIien we obtain the dataflow graph depicted below. An IRAM is 
used for the iaput values and an IRAM for the output values. The first 8 cycles are used to fill the 
pipeline and then the throughput is of one output value/cycle. We caJi depict the code as the following: 

rO = x[0]; 
rl = xtl]; 
r2 = x[2]; 
r3 = x[3]; 
r4 = x[4]; 
r5 = x[5]; 
r6 = x[6]; 
r7 = x[7]; 

for. (i =0; i < 269; i++) { 

y[i] = c7-*r0 + c6*rl + c5*r2 + c4*r3 + c3*r4 + c2*r5 + cl*r6 + cO*ar7; 

rO = rl; 

rl = r2; 

r2 = r3; 

r3 = r4; 

;r*4 - r5; 

r5 = r6; 

r6 = r7; 

r7 = x[i+7]; 

} 



IRAMO 




The final table is shown below, and the expected speedup with respect to a standard superscalar proc- 
essor with 2 instructions issued per cycle is 13.6. 
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Parameter 


Value 


Vector lenctfa 


269 


Rpimerl data, set size 




I/OIRAMis 


2 


ALU 


16 


BRiBG 


0 


FREG 


7 


Data flow graph width 


3. 


Data flow graph height 


9 


Configuratioii (^cles 


8+269=^77 




Ops. 


Number 


LD/ST (2 cycles) 


2 


ADDRCOMP (1 cycle) 


0 


ADD/SUB (1 cycle) 


8 


MUL (2 cycles) 


8 


SHIFT (1 cycle) 


0 


Cycles per iteration 


28 


Cycles needed for the loop (2-way) 


(28*269)/2=3766 



Variant with Larger Loop Bounds 

Let us take larger loop bounds and set the values of AT and Af to 1024 and 64. 

for (i = 0; i < 961; { 
y[±] - Q; 

foiET (j = 0; j < 64; j + H-) 

c[j] * xti+63-j]; 

} 

Following the loop optimizations driver given before, we apply loop tiling to reduce thes iteration range 
of the umer loop. We obtain the following loop nest 

for (i =0; i < 961; 1++) { 
y[±] = 0; 

foi: .{jj = 0; jj < 8; j j++) 
for (j = 0;j < 8;j.4-+) 

y[i] += c[8*jj+j3 * x[i+63-8*jj-j]; 

} 

A subsequent application of loop urxrolling on. the iimer loop yields: 

for |i = 0; i < 961; i++) i 
y[i-] = 0; 

for (jj - 0; jj < 8; jD++) { 

ytd] += c[8*jj] * 2i[i+63-8*jj] ; 
y[i] += c[8*jj+ll ^ x[i+62-8*jjl; 
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y[i] c[8*jj-i-2] * 
y[i] += ct8*jj-i-3] * 
y[i] += c[8*jj + 4] * 
y[i] +- c[8*jj-i-5] * 
y[i] c[8*jj-t-6] * 
y[i] +- ct8*jj+7] * 

} 

Finally we obtain the same dataflow graph as above, except that the coefficients nnust be read from 
another IRAM rather than being directly handled like constants by the multiplications. After shift reg- 
ister synthesis the code is the following: 

for {i = 0; i < 961; { 
rO = x[i+56]; 
rl = x[i+57] ; 
2r2 = x[i+58] ; 
r-3 = x[i+59]; 
r4 = x[i+60] ; 
r5 = x[i+61]; 
r6 « x[i+62]; 
r7 = x[i+63]; 
• for (jj = 0; jj < 8; jj++) 

y[i] = c[8*jj]*rO + c[8*jj+l]*rl + c[8*jj+2]*r2 + c [8*j j-f3] *r3 + 
c[8*jj+4]*3r4 + c[8*jj+5]*r5 + c [S*j j+6] *r6 + c[8* +7].*r7; 

rO = rl; 

rl « r2; 

r2 = r3;. 

r3 - r4; 

r4 = r5; 

r5 = r6; 

r6 = r7; 

r7 = x[i+63-8*jj] ; 

} 

} 

The table is the same than before except for the vector length and the expected speedup with respect to 
a standard superscalar processor with 2 instractions issued per cycle is 17.5. 



Parameter 


Value 


Vector length 


8 


-Reused data setsize 




I/OIRAMs 


2 


ALU 


16 


BREG 


0 


FREG 


7 


Data flow graph width 


3 


Data flow graph height 


9 


Configuration cycles 


8+8=16 



x[i+61-8*jj] 
x[i+60-8*jj] 
x[i+59-8*jj] 
x[i+58--8*jj] 
x[i+57-8*jj] 
x[i+56-8*jj] 
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Ops 


Number 


LD/ST (2 cycles) 


10 


ADDRCOMP(l cycle) 


0 


ADD/SUB (1 cycle) 


16 


MUL (2 cycles) 


17 


SHIFT (1 cyde) 


0 


Cycles per iteration 


70 


Cycles needed for "die loop (2-way) 


(70*8)/2=280 



5.2.3 A More Parallel Solution 

The solution we presented does not expose a lot of parallelism in the loop. We can try to explicitly 
parallelize the loop before we generate the dataflow graph. Of course exposing more parallelism 
means more pressure on the memory hierarchy. 

In the data dependence graph presented at the beginning, the only loop-carried dependence is the de- 
pendence on 5' and it is only caused by the reference to y[i]: Hence we apply nbd^ splitting to get a 
more suitable data dependence graph. We obtain then: 

for (i = 0; i < 249; { 
y[i] = 0; 

for (j « 0; j < 8; j++) 

{ . • 

tmp = c[j] * x[i+7-j] ; 
y[i] += tmp; 

} 

) 

Then scalar expansion-is-performed^on tmp to remove tlie anti loop-carried dependence caused by it, 
and we have the following code: 

for (i = 0; i. < 249; . { . 

y[i] = 0; 

•for (j = .0; j < 8 ; j++) 
{ 

tmp[j] = c[j] * x[i+7-j] ; 
y.[i] += tmp[j ]; • 

} 

> 

The panuneter table is the following: 
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Parameter 


Value 


Vector length 


249 


Reused data set size 


- 


I/OIRAMs 


3- 


ALU 


2 


-BREG- 


0 


FREG 


1 


Data flow graph width 


2 


Data flow graph height 


2 


Configuration cycles 


2+8=10 



Then we apply loop distribiitiofftcrget'a vectorizable and a not vectorizable loop. 

for (i = 0; i < 249; i++) { 
y[i] = 0; . 
for (j = 0; j < 8; 
. tmp[j] = ctj] * x[i+7-jl; 
for (j = 0; j <: 8; j++) 
y[i] += tmpC j] ; 

} 

} 

The parameter table given below corresponds to the two inner loops in order to "be compared with the 
preceding table. 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/OIRAMs . 


5 


ALU 


2 


BREG 


0 


FREG 


1 


Data flow graph, width 


1 


Data flow grapi:i height 


3. 


Configuration cycles 


1*8+1*8=16 



Then we must take into account the architecture. The first loop is fiiUy parallel; this means that we 
would need 2*8=16 input values at a time. This is all right, as it corresponds to the number of IRAMS 
of the PACT XPP. Hence we do not need to strip-mine the first inner loop. Tie case of the second 
loop is trivial, it does not. need to be strip-mined either. The second loop is a reduction, it computes the 
sum of a vector. This is easily found by the reduction recognition optimization we obtain the fol- 
lowing code. 



wo 2004/015561 



PCT/EP2003/008080 



79 

Case Studies 



for (i = 0; i < 249; i++) { 
y[i] - 0; 

for (j =0; j < 8; j++) 

tinp[j] = c[j] * x[i+7-j]; 

/* load the partial sims from memory using a shorter vector length */ 
for (j =0; j < 4; j++) 

aux[j]- = i:mpt2*j] + tmp[2*j+l] ; 

/* accvunulate the short vector */" 
for (j = 0;j. < 1; 

aux[2*j]. = aux.[2*j] + aux[2*j+l]; 

/* sequence of scalar instructions to add up the partial sums */ 
y[i] « aux[0] + aux[2]; 



Like above we give only one table for all innennost loops and the last instruction computing j^/i/. 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/OIRAMs 


12 . 


ALU 


4 


BREG . 


0 


FREG 


0 


Data flow graph width 


1 


Data flow graph height 


4 - 


Configuration cycles 


1*8+1*4+L*1==:^13 



Finally loop unrolling is applied on the inner loops, the number of operations is always less than ttie 
number of processing elements of the PACT XPP . 

for {i = 0; i < 961; i++) 



tmp[0] 




c[0] * 


x[i+7]; 


tmp[l] 




c[l] * 


x[i+6]; 


tmp[2] 




cC2] * 


x[i+5]; 


tmp[3] 




c[3] * 


x[i+4]; 


tmp [ 4 ] 




c[4] * 


x[i+3]; 


tmp [ 5 ] 




c[5] * 


x[i+2];. 


tmp [6] 




c[6] * 


x[i+lj; 


tmp [7] 


= 


c[7] * 


x[i]; 


aux [ 0 ] 




tmp [0] 


+ tmptl]; 


aux[l] 




tmp [2] 


+ tmp[3]; 


aux [ 2 ] 




tmp [4] 


+ tmp[5]; 


aux [3] 




tmp [6] 


+ tmp[7]; 


aux[0] 


S3 


aux [0] 


+ aux[l] ; 


aux [2] 




aux [2] 


+ aux [3];' 


y[i] - 


auxiO] + 


aux [2] ; 



} 
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We obtain then the following dataflow gr^h representing the inner loop. 







XD461 




xP+6] 








x[l+3] 




xD+2] 




xD-i-11 




m 




y[0 

It c6uld .be mapped on the PACT XPP with each layer executed in parallel, thus needing 4- cy- 
cles/iteration and 1 5 ALU-PAEs, 8 of which needed in parallel. As the graph is already synchronized, 
the diroughput reaches one iteration/cyclel, after 4 cycles to fill the pipeline. The coefBcients are 1:aken - 
as constant inputs by the ALUs performmg -the multiplications. 

The drawback of this solution is that it uses 16 IRAMs, and that the input data must be stored in a 
special order. The number of needed IRAN4s can be reduced if the coefficients are handled like con- 
stant for each ALU. But due to data locality of the program, we can assun:ie that the data already reside 
in the cache. And as the transfer of data from the cache to the IRAMs can be achieved efficiently-, the 
configuration can be executed on the PACT XPP without waiting for • the data to be ready in. the 
IRAMs. The parameter table is then tiie following: 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/OlRAMs 


16 


ALU 


.15 • • 


BREG 


0 


FREG 


0 


Data flow graph width . 


. 8 


Data flow graph height 


4 


Coniiguration cycles 


44961 . 



Variant with Larger Bounds 

To make the things a bit more interesting, set the values of N and M to 1024 and 64. 

for (i = 0;. i < 961; i++) { 
• y[i] = 0; 

for (j = 0; j < 64; j++) 
y[i] += c[j] * x[i+63-j]; 
) . 

The data dependence graph is the same as above. We apply then node splitting to get a more con'ven- 
ient data dependence graph. 
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for (i ==0; i- < 961; i++) { 
y[i] = Of 

for |j « 0; j < 64; j++) 
{ 

tmp c[j] * x[i+63-j]; 
y[i] tmp; 

} 

} 



After scalar expansion: 

for (i =0; i. < 961; i++) { 
yti] = Of 

for (j « 0; j < 64; j++) 
{ 

tmp[j] = c[j] * x[i+63-j]; 
- y[i] +=.tmptj]; 
} 

} 

After loop distribution: 

for (i = 0; dL < 961; i++) { 
y[i] - Op 

for (j = 0; j < 64; j++) 

tinpLj] = c[j] * x[i+63-j]; 
for (j - 0; j < 64; j++) 
y[i] += tnip[j]; 

} 

} 

We go through the compiling process, and we arrive to the set of optimizations that depends upon 
architectural parameters. We want to split the iteration space, as too many operations would hav& to be 
performed m parallel, if we keep it as such. Hence we perform strip-mining on the 2 loops. We can 
only access 16 data at a time, so, because of the first loop, the factor/will be 64'*' 2/16 = 8 for the 2 
loops (as we always have in mind that we want to execute both at the same time on the PACT XPP). 

for (i = 0; jL < 961; { 
y[il =0; 

for (jj 0; jj < 8; 
for (j=0;j < 8; j++) 

tmpr8*jj+j] = c[8*jj+j] * x[i+63-8*jj-j]; 
for (jj 0; jj < 8 ; jj+4-) 
for (j=«0;j <.8; j++) 

y[i] += tmp[8*jj+j] ; 



And then loop fiision on the jJ loops is performed. 

for (i = 0; i. < 961; i++) 
y[i] = Op 

for (jj = 0; jj < 8; jj++) { 
for (j=0;j < 8;j++) 

trap[.8*j j+j] = c[8*jj+j] * xti+63-8*jj-j]; 
• for {j=f*0;j < 8;j++) 

ytdL] += tmp[8*jj+j]; 

} 

} 

Now we apply reduction recognition on the second innermost loop. 
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for (i =0; ± < 961; i++) { 
tmp . = 0 / 

for (jj « 0; jj < 8; jj++) 
{ 

for (j = 0; j < 8; j++) 

tmp[8*jj+j] = c[8*jj+j] * x[±+63-8*jj-j]; 

/* load the partial sums from memory using a shorter vector leng-th */ 
for (j « 0; j < 4; 

au-s[jl = tmp[8*jj+2*j] + tmp[8*jj+2*j+l] ^ 

accumulate the short . vector */ 
for (j = 0;j < 1; j++) 

au3c[2*j] = aux[2*j] H- aux[2*j+l]; 

/* seqxience of scalar instructions to add up the partial sums *A 
y[i] = aux[0] + aux[23 ; 

} 

And then loop luiroUing; 

for (i =0; ± < 961; i++) 

for (jj = 0; jj < 8; jj++) 
{ 

tmp[8*jj] = c[8*jj] * x[i+63-8*jj] ; 

tmp[8*jj+l] = c[8*jj+13 * x[i+62-8*jj] 

tmp[8*jj+2] = c[8*jj+23 * x[i+61-8*jj] 

tinp[.8*jj-h3] = c[8*jj+3] * x [i+59-8* j j ] 

tinp[8*jj+4] = c[8*jj+4] * x[i+58-8*jj] 

tmp[8*jj+5] « c[8*jj+5J * x[i+57-8*jj] 

tmp[8*jj+6] = G[8*jj+63 * x[i+56-8*jj] 

tmp[8*jj+7] = c[8*jj+73 * x[i+55-8*jj] 

aux[0] = tmp[8*jj] + tmp [8* j j+1] ; 

aux[l] = tmp[8*jj+2] -h tmp[8*jj+3]; 

aux[2] = tmp[8*jj+4] 4- tmpt8*jj+5]; 

auxC3] « tmp[8*jj+6] + -tmp [8* j j+7] ; 

aux[0] = auxfO] + aux[l]; 
aux[2] =» aux t2] + aux [ 3] ; 

y[i] « aux[0] + aux[2] ; 

} 

We implement the innermost loop on the PACT XPP directly with a counter. The IRAMs are used in 
FIFO mode, and filled according to the addresses of the arrays in the loop. IRAMO, IRAM2, IRAM4, 
IRAM6 and IRAMS contain array c. IRAMl, IRAMS, IRAM5 and IRAM7 contain array x. Array c 
contains 64 elements, that is each IRAM contains 8 elements. Array x contains 1024 elements, that is 
128 elements-foLeach IRAM. Array> is . directly written to memor^, as it is a global array aiad its ad- 
dress is constaot. This constant is used to initialize the address coanter of the configuration. TTie final 
parameter table is the followig: 
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Parameter 


Value 


Vector length 


8 


Reused data set size 




VOTRAMs 


16 


ALU 


15 


BREG 


0 


FREG 


0 


Data flow graph widdi 


8 


Data flow graph height 


4 


Configuration cycles 


4+8=12 



Nevertheless it should be noted that this version should be less efficient than the previous one. As the 
same data must be loaded in the different IRAMs from the cache^^ we have a lot of transfers to achieve 
before the configuration can begin the computations. This overhead must be taken into acc^ount by the 
compiler when choosing the code geixeration strategy. This means also that the first solxition is the 
solution that will be chosen by the compiler. 

5.2.4 Other Variant 

Source Code 

for (i « 0; i < N-M+1; i++) { 
tmp ~ 0; 

for (j = 0; j < M; 

tiap.+=- c[j].* x[i+M^j-l]; 
x[i]. ^ tmp; 

} 

In this case, it is trivial that the data dependence graph is cyclic diie to dependences on tmp . Therefore 
scalar expansion is applied on the loop, and we obtain in fact the same code as the first venrsion of the 
FIR filter as shown below. 

for (i = 0; i < N-M+1; { 
tmpti] = 0; 

for (j = 0; j < M; j++) 

tiripCi] +- c[j] * x[i+M-j-l]; 
x[i] » tmpC!]; 

} 
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5.3 Matrix Multiplication 

5.3.1 Original Code 

Source code: 

#de£lne. L 10 
#de£lne M 15 
#define N 20 

int A [L] [M]; 
int B[M] [M]; 
int R[L] [N] ; 

main ( ) { * 

int i, j, k, tmp, aux; 

/* input A (L*M values) *■/ 
for(i=0; i<L; i++) 
for{j«0; j<M; 

scanf {"%d", &ACi] [p] ) ; 

/* ±nput B (M*N values) */ 
forCi=0; i<M; i++) 

for(j=0; j<N; . 

scanf("%d", &B[i][j]); 

/* multiply */ 
for(i«0; i<L;i++) 

. fbr(j=0; j<N; { 
aux = O;• 
for(lc=0; k<M;'Jc++). 

aux += A[i] [k] * B[k] [j]; 
Rti] [j] = aux; 

* J 

/* write data , stream */. 
for(i=0; i<L; i++) 

•for{j=b; j<N; ' 

printf ("%d\n", Rti]- [ j ] ) ; 

) 

5.32 Preliminary Transformations 

Since no inline-able function calls axe present, no interprocedvEral code movement is done. 

Of the four loop nests the one with the "/* multiply */" comment is the only candidate for running 
partly on the XPP. All others have function calls in the loop body and are therefore discar^ded as can- 
didates very early in the compiler. 

Dependency Analysis 

for(i=0; i<L;i++) 

for(j==0; j<N; j+-H) { 
SI aux = 0; 

for(.k=0; k<:M; k++) 
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3UX += A[l] [k] * B[)c] [j], 
RtUtj] - aux; 
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Figure 60 DcUa ckpendemy graph f^r matrtc mulUplication 

The data dependency graph shows no dependencies that prevent pipeline vectorizatioTi. The loop car- * 
ried true dependence froth S2 to itself can be handled by a feedback of aux as described in [1]. 

Reverse Loop-Invariant Code Motion 

To get a perfect loop nest we move SI and S3 inside the k-loop. Therefore appropriate guards are gen- 
erated to protect the assignments. The code after this traasfonnatibn looks like 

for ( i«0 ; i<L; i++ ) 

for(j=0; j<N; 

for(k=0; k<M; k++) { 

if (k — 0) aux = 0; 

aux -h= A[i] [k] * B[k] 

if (k == M-1) R[i] [j] « aux; 

} 

Scalar Expansion 

Our goal is to interchange the loop nests to improve the array accesses to utilize the cache best. Un- 
fortunately the guarded statem^ents involvinjg aux cause "backward loop carried aati-dependences car- 
ried by the j loop. Scalar expansion will break these depesidences, allowing loop intercliange. 

for-{i«0; i<L;i++) 

for{j-0; j<N; j+-f-)- 

for(k=0; k<M;: k++) { 

if (k =0) aux[j] = 0; . 
aux[j] +« A[i][k] * B[k]Ej]; 
if (k = M-l> Rti][j] = aux[j]; 

} 



Loop Interchange for Cache Reuse 

Visualizing the main loop shows the iteration spaces for tiie array accesses (Figure61)- Since C arrays 
are placed in row major order the cache lines are placed in the array rows. At first sight there seems no 
need for optimization because the algorithm requires at least one array access to stride over a column. 
Nevertheless this assumption misses the fact that the access rate is of interest, too. Clo ser examination 
shows that array R is accessed in every j iteration, whife B is accessed every k-iteration, always pro- 
ducing a cache miss^. This leaves a possibility for loop interchange to improve cache access as pro- 
posed by Kennedy and Allen in [7]. 



^ We neglect "aux" in diis observation since we do not expect it to be written to or read from mecnoiy (no defe or 
uses outside the loop nest) 




Finding the best loop nest is relatively simple. The algorithm simply interchanges each, loop of the 
nests into the innermost position and annotates it with the so-called innermost nxemoiy cost term. This 
cost term is a constant for known loop bounds or sl function of the loop bound for unknown loop 
bounds. Thp term is calculated in three steps. 

• First the cost of each reference^ in the innermost loop body is calculated to 

■ 1, if the reference does not depend on the loop induction variable of tfaie (current) innermost 
loop 

■ the loop count, if the reference depends on ttie loop induction variable and strides over a non- 
contiguous area ia respect of the cache layout 

■ — — , if the reference depends, on the loop induction variable and strides over a contiguous 

b . 

dimension. In this case N is the loop coiint, s is the step size and b is thie cache line size, re- 
spectively. 

■ Second each reference cost is weighted with a faotor for each other loop, iwhich is 

■ lyifthereference.does not 4epend on the Ibop index 

■ the loop count, if the reference depends on the loop index. 

■ Third the overall loop nest cost is calculated by summing the costs of all reference costs; 

After invoking this algorithm for each loop as the immermost, the one with the lo\vest cost is chosen as 
the innermost, the next as the next outermost, and so on. 



Reference means access to an anray m this case. Since th& transformation wants to optini.ize cache access, it 
must address references to the same array within small distances as one. This prohibits over-estunation of die 
actual costs. 
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Innomost loop 


R[i]|j] 


A[i][k] 


B[k]D] 


Memory access cost 


k 


l-L-N 




MN 


b 


i 




l-L-M 


1-M-N 




j 




L-M 


-M 


^{L-¥M)+L'M 

0 



TaSle 1 Loop memory access costs for the different loops being inner- 

most • 



The table shows the values for the matrix multiplication. Since the j term is the smallest (of course 
assuming 6 > 1 ), the j-Ioop is chosen to be the innennost The next outer loop then is k, and the out- 
ermost is i. Thus the resulting code after loop interchange is 

for(i=0; i<L;i++) 

for(k=0; k<:M; k++) . 

for(j=0; j<N; j++) { 

if (k 0) aux[j] = 0; 
aux[j] += A[i] [k] * B[k] [j]; 
if (k == M-l)R[i][j 3 = aux[j]; 

} 




cache tine 




N 













R 



Figure 62 The visualized array access sequences aftew- optimization. Here the improvement is visible to the 
naked eye, since array B is no\sr read following the cache lines. 



Figure 62 shows the improved iteration spaces. It is to say that this optimization does not optimizie 
primarily for the XPPr-but-mainly-optimizes the cache-hit rate, thus improving the overall perform.- 
ance. 
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Unroll and Jam 



After improving the cache access behavior, the possibility for reduction recognition has been (de- 
stroyed. This is a typical example for transfonnations where one exchid^s the oth^. Nevertheless 
obtain more parallelism by doing unroll-and-jam. Therefore we unroll the outer loop partially with^he 
unroll factor. This factor is mainly chosen by the minimum of two calculations: 

« # available IRAMs / # used IRAMs in fte inner loop body 

■ # available ALU resources / # used ALU resources in the imier loop 

In this example the accesses to "A" and "B'* depend on k (the loop whicti will be unrolled). Therefa»re 
they must be considered in the calculation. Tbe accesses to ""aux" and "R" do not depend on k. Th^s 
they can be subtracted from the available IRAMs, but don not need to be added to the denominator. 
Therefore we calculate (assuming an XPP64) 14/2 = 7 for the unroll factor obtained by the IRA^M 
resources. 



On the other hand the loop body involves two ALU operations (1 add, 1 mult), which yields an im- 
rolling factor of approximately 64/2 = 32!*. Th&e constramt generated by the IRAMs therefore dom-i- 
nates by far. 

Having chosen the unroll factor we must trim our loop trip count to be a jmultiple of that factor. Since 
the k loop has a loop count of IS, we peel off the fu^t iteration and unroll -the remaining loop. 

for(i*0; i<L;±++) { 

for(k=0; k<l; k:++) { 

for{j=0; j<N; j++) { 

if (k==0) aux[j] = 0; 
aux[j] A[i][kJ * B[k] [j] ; 
if (k==M-l) Rli][Dl = aux[j]; 

} 

} 

for{k=l7 k<M; k+«7) { 

■fox(j=0; j<N; j++) { 

if (k==0) aux[j] = 0; 
aux[j] +- A[i].(k] * B[k] [j]; 
if (k=-M-l) R[i] [jl = aux[j]; 

} 

fo3c{j=0; j<N; j++) { 

if (k+l==0) aux[j3 » 0; 
-auX'.[j-]-'+--A-H.][k+X] * B[k+l][j]; 
if (k+l=-M-l) R[i] [j] = aux[j]; 

} 

fox:(j=0; j<N; j++) { 

if (k+2=0) auxtj] « 0; 

aux[j] +=A[i][k+2] *B[k+2][j]; 

if (k+2=-M-l) R[i] [j] - aux[j]; 

} 

fO3r(j=0; j<N; j++) { 

if {k+3=0)- aux[ij] « 0; 

aux[j] += A[i]lk+3] * B[k+3]rj]; 

if (k+3=M-l) R[i] [j] = aux[j]; 

} 

|fo2r(j«0; j<N/ { 

if (k+4=«0) aux[j] = 0; 

aux[j] +»A[i][k+4] *B[k+4][j]; 

if (k+4=«M-l) R[i] [j] = aux[j]; 



^ This is a very inaccurate estimation, since it neither estimates the resources spent l>y the controlling network, 
which decreases the unroll factor, nor takes it into account that e.g the BR£G-PA£s also have an adder, which 
increases the unroll &ctor. Although it has no influence to this example die unroll factor calculation of course 
has to account for this in a production compiler. 
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} 

for(j=0/ j<N; j++) { 

if (k+5=0) atax[j] = 0; 
aux[j] A[i] [k+5] *B[k+5][j]; 
if (k+5=M-l) R[i][j] = aux[j]; 

) 

for(j=0; j<N; j++) { 

if (k+6— 0) aux[j] = 0; 
aux[j] += A[i] [k+6] *Btk+6][j]; 
if (k+6=M-l) R[i][j] - aux[j]; 
} ■ • 



Due to the fact that the reverse loop invariant code motion placed the loop mvariant code mto the inner 
loop v/hich is now duplicated seven times, it is very likely that dead code elimination can get rid of 
some of Aese duplicates. Thus the code is shortened to 

for(i=0; i<L;i++) { 

for(Jc=0; k<l; k++) { 

for(j=0; j<N; j++) { 

if (k==0) auxCj] = 0,/ 
aux[j] A[i] [k] * B[k][j]; 

} 

) 

for(k:«l; k<M;-k+=7)- { 

for(j=0; j<N; j++) { 

aux[j].+= A[i] [k] * B[k][j]; 

} 

for(j=0; j<N; j++) { 

aux[j] += A[i] [k+1] * B[k+l][j]; 

} 

for(j=d; j<N; j++) { 

auxtj] +- A[i] [k+2] * B[k+2][j]; 

} 

for(j=0; j<N; j++) { 

aux[j] += A[i] [k+3] *B[k+3][j]; 

} 

for(j-0; j<N; j++) { 

aux[j] A[i] [k+4] * B[k+4][j]; 

} 

for(j-0; j<N; j++) { 

aux[j] += A[i3 [k+5] * B[k+5][j]; 

} 

for(j-0; j<N; j++) { 

aux[j] += A[i] [k+6] * B[k+6][j]; 
if (k+e^^M-l) R[i][j] = aux[j]; 

} 

1 . 

} 

Before we jam the mner loops we have to account for the fact that toe first iteration of the k loop was 
peeled of whichi would produce an own configuration. Since we calcimlated the unroll-and-jam factor to 
fit into one configuration, this side effect has to be prevented. Becaase it should be no problem to run 
the k loop with variable step sizes, we fuse the k loops again and a^djust the step size and guiar<l the 
statements. This yields 

forCi'^O; i<Ii;i++) { 

for(lc=0; k<M;. k+= k<l ? 1 : 7) { 
for(j=0; j<N; { 

if (k==0) aux[j] = 0; 

if (k==0) aux[j] += A[i] [k] * BtkJ [j]; 

) 

for(j=0; j<N; j++) { 
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if (k>0) aux[j] +«A[i][kl * B[k][j]; 
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for(j=0; j<N; { 

if (k>0) aux[j] += A[i][k+1] * B[k+l][j]i 

} 

for(j=0; j<N; + ) { 

if (k>0) a.ux[j] += A[i] [k+2] * B[k+2][j], 

} 

for(j=0; j<N/ { 

if (k>0) a.ux[j] += A[i] [k+3] * B[k+3][j]; 

} 

for(j«0; j<N; { 

if {k>0) a.ux[j] +=A[i][k+4] * B[k+4][j]j 

} 

for(j=0; -j<N; { 

if (k>0) a.ux[j] +« A[i] [k+5] *B[k+5][j]; 

} 

for(j=0; j<N; j4-+) { 

if (k>0) a.ux[j] A[i] [k+6] *.B[k+6][j]j 
if (k+6«=M-l) R[i][j] « auxCi]]; 

) 



} 

Now we can jam the inner loops and jGnally obtain 

fbr(i=0; .i<L;i++) { 

for(k«0; k<M; k+= k<l ? 1 : 7) { 
for(j=0; j<N; { 

if (k==0) aux[j] = 0; 

if {k=0) aux[j] +=A[i][k] ^ B[k][j]. 

if (k>0) I 

auxCj] += A[i][k] * B[l<]tj]; 
auxCj] +=A[i][k+l] * B[k+l][j] 
auxCj] += A[i] [k+2] * B.[k+2][j] 
auxCj] +- A[i] [k+3] * B[k+3][j] 
auxCj]. += A[i] [k+4] * B[k+4][j] 
auxCjl +=A[i][k+5] * B[k+5][j] 
auxCj] += A[i] [k+6] * B[k+6][j] 
if Ck+6=M-1) R[i] [j] aux[j] ; 



} 



) 



5.3.3 XPP Code Generation 



The innermost loop can bie synthesized in a configuration, wrhich uses 14 IRAMs for the input data, 
one lRAM to temporary store aux aad one IRAM for the output array R. Furthermore it is necessary to 
pass the value of k to the XPP to direct the dataflow. This nxay be done by a streaming input. Figure 
63 shows the dataflow graph of the s^oithesized configuration. 




Figure 63, Dataflow graph of matrix multiplication after unroll and Jam. The rightmost 3 branches are omitted 
Event connections are emphasized^ in red color. 



The following code shows the pseudo code executed on the RISC processor. 
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XPPPreload ( conf ig ) 
for{i=0; i<L;i-f+) { 
XPFPreload(0, 
XPPPreloadd, 
XPPPreload(2, 
XPPPreload ( 3 , 
XPPPreload{4^ 
XPPPreload {5, 
XPPPreload (6, 
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&A[i] [0], M) 
&A[i] [0], M) 
&Ati3 [0], M) 
&A[i] [0], M) 
&A[i][0], M) 
&A[i] [0], M) 
&A[i] [0]^ M) 
XPPPreloadClesui(15, &R[i][0], M) 
for(k»0; k<M; k+= k<l ? 1 : 7) { 
XPPPreload(7, &B[k][0], N) 
XPPPreload{8, &B[k+l][0], N) 
XPPPreload(9, .&B[k+2][0], N) 
XPPPreload (10, &B[k+3][0], N) 
XPPPreload (11, &B[k+4] [0] , N) 
XPPPreload (12, &B[k+5][0], N) 
XPPPreload (13, &B[k+6] [0] , N) 

XPPExecMta(config, • IRAM(O), rilAM(l), IRAM(2), IRft^{3), 

IRAM(4), IRAM{5), IRAM(6) , IRAJyi(7), 
IR7^(8), IE^(9), IRAM(IO), IFCAM(ll), 
IRAM(12), a:RAM{13), IRAM(15),ki) 

) 



The table shows the simulated configuration. The complete multiplication needs about 3120 cycles 
without the preloading and configuration. A typical RJSC-DSP core with two units and hard- 

Avare loop support needs over 26000 cycles (when data: is in zero-latency internal, memory). Although 
the time for preloads and cache misses is neglected here, the values promise unprovements of 200-300 
percent compared to a standalone RISC core. 



Parameter 


Value 


Vector length 


20 


Reused data set size 


20 


yOIRAMs 


141+ lO+l internal 


ALU 


20 


BREG 


26 (8 defined + 1 8 roiiteD 


FREG 


28(4 defined + 24 roiite) 


D^ta flow graph, width 


14 


Data flow graph height 


' 6 (without routing and balancing) 


Conflguratton cycles (simulated) 


con:IRguration 


2633 




preloads 


10*3*7*5 1050 
10*7*15 1050 




cycles 


(k=0) 112 + 
(k=l>10O + 
(k-=7)lOO 
*10« 3120 




sum. 


7853 
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5.4 Viterbi Encoder 



5.4.1 Original Code 

Source Code: 

/* C-language butterfly */ 
#define BFLY(i) {\ 

unsigned char met ric,mO^ ml, decision; \ 

metric = { (Branchtab29_l [ij syml) + 

(Branchtab29_2[i] ^ sym2) + l)/2;\ 
mO = vp->old_raetrics [i] + me-toric; \ 
ml = vp->old_metrics[i+128] + (15 - metric) ;\ 
decision = (mO-nil) >=0;\ 

vp->new_iaetrics [2*i] = decision ? ml : mO;\ 
vp->dp->w[i/16] .1= decision <<: ((2*i)&31);\ 
mO (metric+metric-15) ;\ 
ml += (metric+metric--15) ;\ 
decision = (mO-ml) >= 0;\ 

vp->new_jnetrics [2*i+i ] . = decision ? ml : mO;\ 
vp-?>dp->w[i/16] 1= decision « ( (2*i+l) &31) ;\ 

} 

int .update-viJterbi29 (void *p, unsigned char syml, unsigned char sym2) { 
int i; 

struct v29 *vp - p; 
unsigned chiar "^tmp; 
int normalize = 0; 

for (i=0;i<8 ;i++) 
vp->dp->v>7 [i]; = 0; 

for{i=0;i<128;i++) 
BFLY(i); 

/* Renormaiize metrics */ . 
if (vp->new__metrics [0] > 150) { 
int i; 

unsigned char minmetric = 255; 

for (i=0;l<64;i++) 

:if (vp->newjrae tries [i] < minmetric) 

minmetr-lc = vp->hew^me tries [i] ; 
for (i'«0;l<64;i++)' 

vp->new__metrics [i] -= minmetric; 
normalize - lainmetric; 

} 

vp->dp++; 

tmp f vp~>old_metrics; 
vp->oldjuetrics = vp->new_metrlcs; 
vp->new_me tries = tmp; 



return normalize; 
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5A2 Interprocedural Optimizations and Scalar Transformations 

Since no inline-able function calls are present, no interprocedural code movement is done. 

After expression simpliftcatioii, strength reduction^ SSA. renaming, copy co^descing uid idiom recog- 
nition, the code looks like (statements reordered for convenience). 
Note that idiom recognition will find the combination ofminQ and use of the compsarison result for 
decision and ^decision. However the resulting computation cannot be'expressed m C:» so we describe it 
as a comment: 

inii update_viterbi29(vo±ci *p, unsigned charr syml, unsigned char synxiZ) { 
±nt i; "* 
struct v29 *vp = p; 
unsigned char *tmp; 
±nt normalize = 0; 

char *_vpdpw_= vp->dp— >w; 
for (i=0;i<8;i++) 
*_vpdpw_++ = 0; 

char *_bt29_l= Branchtab29_l; 
char *_bt29_2= Brancht:ab29_Z; 
char *_vpomO= vp->old^metrics; 
char *_vpoml28= vp->oldjaetrics+128; 
- char *_vpnin= vp->ne.v/_inetrics; 
char *_vpdpw= vp-:>dp->w; 

for(i=»0;i<128;i++) { 

unsigned char metric,_tmp, mO,ml^._mO, ml, decision, ^decision.; 

metric = ( (*_bt29_l-++ ^ syml) + 

{*_bt29_2-f+ sym2) + l)/2; 
_tmp« {metric+metric-15) ; 
mO' = *_vpom++ + metxic; 
ml = *_vpoml28++ + (15 metric); 
jmO = mO - „tmp; 
__ml = ml + _tmp; 
//. decision = mO >= ml; 
// _decision = _raO >= _ml; 

*_vpruti++ « min(mO,ial) ; // = <iecision ? ml : :.mO 

*__ypnm++ = minjjnO, _ml) ; // = ^decision ? _jal : JaO 

. _vpdpw[i » 4] t= ( mO >== ml) /* decision*/ « ((2*i) & 31) 

I C_mO >= jnl) /*__dec5_sion*/ « {(2*i+l)&31) ; 
} . 

y* Renormalize metrics */ 
if (vp->new_metricstO] > 150) { 
int i; 

unsigned char mirmietric = 255; 

char *__vpnm= vp->new__metrics; 
for(i=0;i<64;i++) 

minmetric = min(inLinmetric, *'vpnm++) ^ 



char *__vpnm= vp'->newjmetrics; 
for (i=0;i<64;i+-i-) 

*vpnm++ -= minmetiric; 
normalize == minmetzric; 
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vp->dp++; 

trap = vp->old_metri-Cs; 

vp->old_me tries = vi)->new_jnetrics; 

vp->new_jnetrics =» fcmp; 



return normalize; 



5.4.3 Initialization 



The first loop (setting vp->dp->w[0..7] to zero) is most efficiently executed on the RISC. 



5A4 Butterfly Loop 



The second loop (with the JBFLYQ macro expanded) is of interest for the XPP compiler and needs 
further examination: 

char *iramO= BranchLtab2 9_1 ; // XPE>Preload(0, Branchtab29_l, 128/4); 

char *iram2» Branchitab29_2; // XPE>Preload(2, Branchtab2 9_2 , 128/4); 

char *iram4= vp->ol.d_jaetrics; // XPE'Preload(4, vp->oldjiietirics, . ^ 128/4); 

char *iram5= vp->o3-d_metrics+128; // XPE>Preload(S^ vp->old_inetirics+128, 128/4) ; 

short *iram6= vp->n.ewjiietrics; // XPE>preload(6, vp->new_metirics, 128/2); 

unsigned iong *irain7== vp->dp->w; // XPE>Preload(7, vp7>dp->w, 8); 
// syml & syaiZ are in IRAM 1 & 3 • 

for(i-0;i<128;i++) { 

unsigned char metric, _tinp, mO^ml/jnO, _ml 

metric = ({*iramO++ ^ syml) + 

(*iramX++ sym2) + l)/2i 
_tinp= Cnietric « 1) -15; 
mO = *iram2++ + metric; 
ml = *iram3++ + (15 - metric); 
jmO = mO - _tmp; 
_ml = ml + _tmp; 

// assiiming big endian; little endiar:i has the shift on the latter min() 
. *iram6++ = (min(mO,ml) « 8) j min( jcnO,__ml) ; 
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Parameter 


value 


Vector length 


128 


Reused data set size 




1/OlRA.Ms 




ALU 


lo 






FREG 


few 


Data flow graph width 


4 


Data flow graph height 


11 


Configuration cycles 


11+128 



We immediately see some problems: IRAM7 is folly busy reading and rewriting tiie same address 
sixteen times. Loop tiling to a tile size of sixteen gives thGredundant load store elimination a chance 
to read the value once and accumulate the bits in the temporary, writing thie value to the IRAM at the 
end of this inner loop. Loop Fusion with the iiiitialization loop then allo^ws propagation of the zero 
values set in the first loop to the reads of vi>->dp->w[i] (IRAM7), eliminating the first loop altogether-. 
Loop tiling with a tile size of 16 also eliminates ihecS: 31 expressions for the shift values: Since the 
new inner loop only runs from 0 to 16, the value range analysis now finds that the <& 31 expression is:, 
not limiting the value range any further. 

All remaining input IRAMs are character (8 bit) based. So we need split networks to split the 32-brt 
stream into four 8-bit streams which are then merged. This adds 3 shifts, 3 ands and 3 merges f(mc 
every character IRAM. The merges could be eliminated, when uiu*olling the loop body. However-, 
unrolling is limited to unrolling twice due to ALU availability as well as.di&e to the fact, that IRAM6 i:s 
afaready 16 bit based: uiux>lling once requires a shift by 16 and an or to write 32 bits in every cycle- 
unrolling further cannot increase pipeline throughput any more. So the body is only unrolled once-, 
eliminating one layer of merges. This yields two separate pipelines, that each handle two eight hi-t 
slices of the 32-bit value from the IRAM, serialized by merges. 



The modified code now looks like (um-olling and splitting omitted for simplicity): 



char *iramO= Branchtab29_l; 
char *iram2= Branchtab29_2; 
char *iram4= vp->old_jaetrics; 
char *iram5=» vp->old_metrics+128 ; 
short *iram6= vp->new_inetrics; 
unsigned long *irain7» vp->dp->w; 
// syml & sym2 are in IRAM 1 & 3 



// XPPPreload(0, Branclitab29_l, 128/4); 
// XPPPreload(2, Branciitab29__2, 128/4);. 
// XPPPreload(4^ vp->oadjnetrics, 128/4); 
// XPPPreload(5, vp->oad_metrics+128, 128/4) ; 
// XPPPreload(6, vp->aew_metrics, 128/2); 
// XPPPreload(7, vp->d:p->w, 8); 



for (_i«0 ;_i< 8 ;_i++ ) { 
rlse« 0; 

for(i2=0;i-2<32;i2+=2) { 

unsigneci char metric, _tinp, mO ^ ml , _inO , _ml ; 



metric = ((*iramO++ ^ syml) > 
.(-^i-rainH-+-*-sym2) + 
_tmp= (metric* « 1) -15;. 
mO = *i2ram2++ + metric; 
ml = *irram3++ + (15 - metric) . 
_mO = mO - _tmp; 
_ml mX + __tmp; 
*iram6+4- = (min{mO,ml) << 8) 
rise = rise | ( mO >=» ml) << 
I ( mO >= ml) << 



l)/2; 



min (joaO,_ml) ; 
i2 

(i2.+l); 




merge 



Parameter 


Value 


Vector length 


128 


Reused data set size 




I/OIRAMs 


61+20 


ALU 


2*24+4*3(split)+20oin)« 62 


BREG 


few 


FREG 


few 


Data flow graph width 


4 


Data flow graph height 


ll+3(split) 


Configuration cycles 


14+64 



5.4.5 Re-Normalization: 

The IMormalization consists oF a loop scanning the input for the minimum and a second loop that sub- 
tracts the minimum, from all elements. There is a data dependency between all iterations of the first 
loop and all iterations of the second loop. Therefore tbe two loops cannot be merged or pipelined. 
They will be handled individually. 

Minimum Search 

The third loop is a minimum search on a byte array. 

char *iraraO « vp->n'ew_me tries; // XPPPareloaciCO, vp->new metrics, 64/4); 
for (i=0;i<64;i++) 

minmetric = min (lainmetric, *iramO++) j 
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Parameter 


Value 


Vector length 


64 


Reused data set size 


- 


I/OIRAMs 


1+1 


ALU 


1 


BREG 


0 


FREG 


0 


Data flow graph width 


1 


Data flow graph height 


1 


Confi^ratioii cycles 


64 



Reduction recogaitipn eliminates the dependence for minmetric ena^biing a four-times unroll to utilize 
the IRAM width of 32 bits. A split network has to be added to separate the 8 bit streams using 3 
SHIFT and 3 AND operations. Tree balancing re-distributes AeminO operations to minimizi^ the tree 
height. 

char *iramO = vp->new_met3rics; // XPPPreload(0, vp->nGw metrics, 16) ^ 
for(i=0;i<:16;i++.) 

iniiuaetr±c = rain{niininetrJ-c, rain( min (*iramO++ ^ *irainO++) , 

inih(*iram04-+^ *iramO++) )); 



Parameter 


Value 


Vector length 


16 


Reused data set size 




VOIRAMs 


11+10 


ALU 


-4*min 


BREG . . 


3*sfciln+3*shm 


FREG 


0 


Data flow graph width 


- 4 


Data flow graph height 


5 


Configuration cycles 


5+16 



Reduction recognition again eliminates the loop carried dependence for minmetric, enabling loop til- 
ing and then unroll and jam to increase parallelism; the maximum for the tiling size is 16 IRAMs / 2 
IRAMS = 8. Constant propagation and tree rebalancing reduces thie dependence height of the final 
merging expression: 



char 
char 
char 
char 
char 
char 
char 
char 



*irainO= 
*irainl= 
*irain2= 
*irain3= 
*irain4= 
*irain5= 
*irain6= 
*iraia7= 



vp->new_ 
vp->new^ 
vp->new_ 
vp->new_ 
vp->new_ 
vp->new_ 
vp->new_ 
vp->new_ 



metrios; 

_inetr±os+8; 

_metrics+16; 

_metr±c:s+24 ; 

j:netr±os+32; 

_nietr±cs+40; 

_metrics+48; 

inetrics+56; 



// 
// 
// 
// 
// 
// 
// 
// 



XPPPreloadCO, 
XPPPreloadCl, 
XPPPreloadC2, 
XPPPreloadCS, 
XPPPreloadC4, 
XPPPreioadCS, 
XPPPreloadC 6, 
XPPPreloadC7, 



vp->new^ 
vp->new_ 
vp->new_ 
vp->new 
vp->iiew_ 
vp->new^ 
vp->new_ 
vp->new 



metrics, 2); 
_nietric3+8 ^ 2 ) ; 
jnetrics+16, 2) 
metrics+a-^, 2) 
_metrics-f-3S , 2) 
_inetric3+40, 2) 
nietrics+4Qr 2) 
_metrica+56r 2) 
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for{i=0;_i<2;i++) { 
minmetricO min (minmetricO 



minmetricl 
minnietric2 
ininxnetric3 
' iiiininetric4 
minmetricS 
minmetrice 
minmetric? 



min (minmet ricl 
min (iainmetzric2 
min (minmet:. ric3 
min (minmet r ic4 
min (minmet ric5 
min (minmet ric6 
min (minmet xic7 
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min( min(*iramO++^ 

min(*iramO++/ 
min( min(*irsml++, 

min(*ir3ml++, 
inin( min(*iram2++, 

min(*iram2++, 
min( min{*iram3++, 

min(*irsun3++, 
min{ min(*irsm4++, 

min(*irsm4++, 
min{ rain(*iram5++, 

min(*iram5++, 
min( min(*iram6++, 

min(*iram6++, 
min( min(*iram7++, 

min (*irsm7++. 



*iramO++), 
*iramO++) ) > 
*iraml++) , 
*iraml++) ) > 
*iram2++), 
*iram2++) ) > 
*iram3++), 
*iram3++) ) > 
*iram4++), 
*iram4++) ) > 
*iram5++) , 
*iram5++) ) > 
*iram6++) , 
*iram6++) ) > 
*irara7++), 
*iram7++) ) > 



minmetric - min( min{ (min ( itiinmetric_0/ minmetdric^l) / 

min ( minmet ric_2, minmet3ric_3) ) , 

min ( (min ( ininmetric_4 , minmet 3ric_5 ) , 

min (iainmetric_6, minmet a:ic_7) ) 



Parameter 


Value 


Vector length 


2 


Reused data set size 




I/OIRAMs 


8I+10 


ALU 


8'^4*min = 32 


BREO 


8*(3*sliln+3*shm)=48 


FREG 


0 


Data flow graph widdi 


8*4=32 


Data flow graph height 


5 


Configuration cycles 


8+2 



Re-Normalization 

The fourth loop subtracts the minimum, of the third loop £rom each element in the arrays. The read- 
modify-write operation has to be broken up into two IRAMs. Otherwise the IRAM ports will limit 
tfarou^put. 

char ^iramO= vp->newjfnetra. cs ; // XPPPreload (0, vp->new__metric:::s^ 64/4) 

char "^iraral^ vp->new_metri_cs; // XPPPreloadClean (1^ vp->new_met?:ic=:s, 64/4) 
for(i=0;i<64;.i++) 

*iraml++ = *iramO++ - mi-nmetric; 
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Vector lencth 


fid 


Reused data, set size 




I/OlRAMs 


21+ lO 


ALU 


1 


BREG 


.0 


FREG 


0 


Data flow graph width 


1 


Data flow graph height 


1 


Configuration cycles 


64 



There are no loop carried dependencies. Since the data size is bytes, the inner loop cazn be unrolled 
four times without exceeding the IRAM bandwidth requiremexits. Networks splitting the 32-bit stream 
into 4 8-bit streams and re-joining the individual results to a common 32-bit result streanm are inserted. 

ciiar *ifaniO= vp->new_itietrics ; // XPPPreload (0, vp->newjnetirics, 16) 
char *iraml= vp->new_iTietrics ; // XPPPreloadClean (1, vp->new_metrr±cs, 16) 

for (i=0;i<16;i++) { 

*iraml++ = *iramO++ - minmetric; 
*irainH-+ « *iraiQO++ ~ minmetric; 
*iraml++ = *iramO+-h - minmetric; 
*iraml++ = *iraiaO++ - minmetric; 
• } 



Parameter 


Value 


Vector length 


16 


Reused data set size 




I/OIRAMs 


2I+P 


ALU 


4*4(sub)=16 


BREG 


^*shta+6*shm= 12 


FREG 


0 


Data flow graph width 


4 


Data flow graph height 


5 


Configuration cycles 


2(spLit)+4*l(sub)+20oin>= 8 



Unroll and jam can be applied after loop tiling, in analogy to tlie third loop, but loop tilirag is now lim- 
ited by the BREGs used by the split and join networks. The computed tiling size (unroll fector) is 64 
BREGs/12 BREGs = S, which is replaced by 4, since the same throughput is achieved vw/ith less over- 
head. 

ctiar *iramO= vp->nev/_rTie tries; //XPPPreload (0,vp->new3et; xics, 4) 

ctiar *iraml= vp->new__metrics ; // XPPPreloadClean (1, vp->new_met xics, 4) 
ciiar *iram2= vp->newjmetrics+16; // XPPPreload (2, vp->new_niet xics+16, 4) 
cliar *iram3= vp->newjmetrics+16; // XPPPreloadClean(3, vp->newjttet xics+16, 4) 
cliar *iram4= vp->newjmetrics+32 ; //XPPPreload (4, vp->newjnet xics+32, 4) 
clnar *iram5= vp->new_rne tries +32; // XPPPreloadClean{5, vp->newjnet xics+32, 4) 
ctiar *iram6= vp->new_jnietrics+48; //XPPPreload (6, vp->new_met xics+48, 4) 

char *iram7= vp->new_jnietrics+48; // XPPPreloadClean(7, vp->newjnet xics+48, 4) 
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for (i=0;i<4 
*iraral++ 
*iraml++ 
*iraral++ 
*iraml++ ' 
*iram3++ 
*irara3++ ' 
*iram3++ = 
*iram3++ « 
*iram5++ = 
*iram5++ 
*irain5++ = 
*iram5++ = 
*iram7++ ^ 
*irani7++ « 
*iram7++ » 
*iram7++ • 



;i++) { 
= *iraiQO++ 
= *iramO++ 
= *irainO++ 
« *irainO++ 
= *irain2++ 
= *irain2++ 
» *irajni2++ 
= *iraia2++ 
= *iraia4++ 
= *irajm4++ 
= *irain4++ 
= *ira3n4++ 
= *iraiii6++ 
=» *iram6++ 
= *irain6++ 
= *irain6++ 



} 



minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 



// first pipeline 
// seconci pipeline 
// third p>ipeline 
// fourth pipeline 



Parameter 


Value 


Vector lengdi 


4 


Reused data set size 




I/QIRAMs 


51+40 


ALU 


4*(6(split)+4(sub)+€0oin)) = 6-4 


BREG 


4*(6*shbi+6*shm)=48 


FREG 


0 


Data flow graph width 


16 


Data flow graph height 


1 


Configuration cycles 


2(split)+4* l(sub>l-2aoin)= 8 



5.4.6 Final Code 

Finally we arrive at the foUovving code: 

int update_viterbi29 (void *p, unsigned char sym]., unsigned char s ym2) { 
int i; 

struct v29 *vp =» p; 
• unsigned char *tnip; 
int normalize = 0; 

// initialization loop eliminated 
// for (i=0;i<8;i++) 
/'/ vp->dp->w[i] » O; 



// Configuration for- butterfly loop 
char *iramO= Branch t ab29_l ; // 
char *iram2= Branchtiab29_2 ; // 
char *iram4= vp->olca_me tries; // 
char *ifam5= vp->olci_metrics+128; // 
short *iram6= vp->new_me tries; " // 
unsigned long *irani7= vp->dp->w; // 
// syml & sym2 are in IRAM 1 & 3 



XPE»Preload(0, Branchtab2d_l^ 128/4); 
XPP* Preload ( 2 , Brancht ab2 9^2 ^ 128/4); 
XPE*Preload('4, vp->old_metric:s^ X28/4) ; 
XPE»Preload(5,. vp->old_metric=s+r28, 128/4) ; 
XPE*Preload(6, vp->new_inetriczs, 128/2); 
XPF* Preload (7, vp->dp->w, 8)^ 
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for (_i=0;_i<8;_i+-H) { 
rlse= 0; 

for ( 12=0 ;i<32;i2+=*2) { // unrolled once 
unsigned char metric, _tmp, mO/iiil,_jcnO,_ml 

metric - ( {*irsunO++ ^ syml) + 

(*irainl++ sym2) + l)/2; 
_tnip= (metric « 1) -15; 
mO = *irani2++ laetric; 
ml = *iram3++ + (15 - metric); 
_niO = raO - _tnip; 
_ml = ml + _tinp; 

*iram6++ =: (min(mO,ml) « 8) | mixi (_mO,_ml) ; 
rise = rise | ( mO >= ml) « 12 

I (_mO >== _ml) « (i2-+l); 
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} 



*^lram7++ = rise; 



/* Renormalize metrics */ 
If (vp->new_metrics [0] > 150) { 
int 1; 

// Configuration for loop 3 

char *iramO= vp->new__metrics; // XPPPreload(0,. vp->newjaetrics, 8); 

char *iraml= vp->newjuetrics+8; // XPPPreloadd, vp->hewjaetrics+8 , 8); 

.char *iram2= vp->new_metrics+16; // XPPPreload(2, vp->new_m^trics+lG, 8) 

char *iFam3= vp->new_metrics+24; // XPPPreload(3, vp->new_metrics+24, 8) 

char *lram4= vp->new_metrics+32; // XPPPreload(4, vp->new_motrics+32, 8) 

char *iram5= vp->new_metrlcs-l-40/ // XPPPreload(5, vp~>new_metrics+40, 8) 

char *iram6= vp->new_metrlcs+48; // 

char *iram7= vp->new_metrics+56; // 
for (i«0;_i<2;i++ ) {. 



XPPPreload(6, vp->new_me tries +48/ 8) 
XPPPreload(7, vp->new m©trics+56, 8) 



minmetricO 
• ininmetricl 
mlnmetrlc2 
minmetric3 
. minmetric4 
minmetricS 
minmetric6 
minmetric? 

*} 



riii'n (minmetricO 
min (minmetricl 
rnln (minmetric2 
min (minmetric3 
rnln (minmetric4 
min (minmetricS 
min (mlnmetric6 
min (minmetric7 



min( min(*iramO++, *lraiaO++) , 

min(*iraraO++, *iraraO++) 

min( mln(*iraml++, *lraial++) , 

min ( * i rami ++ , *lraia 1++ ) 
miii( min(*lram2++/ 

min(*iram2++, 

viLxi( min(*iram3++, *lraia3++) , 

min (*iram3++, *irarrL3++) 

mlii( min(*iram4++, *irain-4++) , 

rnln (*iram4++/ *iram.4++) 

min( min{*iram5++, *iram-5++) , 

min(*iram5++, *iram.5++) 

miii( min(*iram6++, *iram.6++), 

rain (*iram6++/ *iram.6++) 

min( min(*iram7++, *iram.7++), 

min{*iram7++, *iram^++) 



*lraia2++) , 
*iraitv2++) 



minraetrlc = min( min ( (mln(mimnetric_0, minmetric_l) , 

' min (minmetrlc 2, minmetric_3) ) , 

min ( (min (minmetrlc_4 , minmetric^S ) , 

min (minmetrlc 6, mlnmetric_7 ) ) ; 

// minmetrlc is written to the output I HAM 
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// Configuration 

char *iramO= 
char *iraml= 
char .*iraia2=s 
char *iram3« 
char *iram4«= 
char *iram5=» 
char *iram6=» 
char *iraia7= 
for(i=0;i<4; 
*iraiiil++ = 
*iraiiil++ = 
*iraml++ = 
*iraml++ = 
*iram3++ = 
*irain3++ = 
*iram3++ = 
*iram3++ = 
*iram5++ = 
*iram5++ = 
*iram5++ = 
*iram5++ = 
*irani7++ = 
*iram7++ = 
*iram7++ = 
*iram7++ = 

) 



for loop 4/ minmetrio 

vp->new_inet rics ; / / 

vp->new__met r ics ; / / 

vp->new_raetrics+16; / / 

vp->new_metrics+16; // 

vp->new_jaetrics+32; // 

vp->newjiietrics+32; / / 

vp->new_metrics+48; /'Z 

vp->newj:aetrics+48? /'Z 

■ *iramO++ -r minmetric ^ 
i *iramO++ - rainmetric; 
! *irainO++ - mininetric; 
' *irainO++ - minitietric ; 
^ *iram2++ - minmetric; 

*iram2-t-+ - minmetric; 
' *iram2++ - minmetric; 
-*iram2++ - minmetric; 
■*iraia4++ - minmetric; 
-*iram4++ - minmetric; 
.^iram4++ - minmetric; 
= ^iram4++ - minmetric; 
= '*iram6++ - minmetric; 
* '*iraia6++ - minmetric; 
» •*iram6++ - minmetric;- 
s ■»^iram6++ - minmetric; 
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is in an input 
XPPPreload 
XPPPreloadClean 
XPPPreload 
XPPPreloadClean 
XPPPreload 
XPPPreloadClean 
XPPPreload 
XPPPreloadClean 
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IRAM 

(0,vp— >new_ 
(l,vp— >new_ 
(2,vp— >new^ 
(3/Vp— >new_ 
(4/Vp— >new_ 
{5/Vp— >new_ 
(6,vp-^>new_ 
(7,vp— >new_ 



metrics, 4) 
metrics, 4) 
_metrics+16r4) 
metrtcs+16,4) 
_metrics+32, 4) 
metrics+32, 4) 
_metrics+48,4) 
inetrics+48,4) 



// first pipeline 



// second pipeline 



// third pipeline 



// fourth pipelines 



normalize » md-nmet2;ic; 



vp->dp++; 

tmp = vp->old_metrics; 
vp->old_metrics. = vp->newjnetrics; 
vp->newjnaetrics . = tmp; 

return normalize; 



Performance Considerations 

In this example we do txot have a high data locality. Every input data item is read e^^actly once. Only in 
the case of re-nonnalization, the newjnetric array is re-read and re-written. To fully utihze the PAE 
array loop tiling was used - in conjunction with reduction recognition to break dependencies usmg 
algebraic identities. In some cases (minimum search) this leads to extremely short vertor lengflis. ITns 
does not hurt as it still does reduce the nmning time of the configuration and the? transfer tune from IJe 
top of the memory hierarchy to.the IRAMs stays th^ same. The vector length could be mcreased rfthe 
outer loop that calls the function was known - the additional data could be used to mcrease flie till 
grade of the IRAMs by unrolling the outer loop, keeping the vector length longer. This would further 
increase configuration performance by reducmg oveiall pipeline setup times. 

Performance of XPP for this example is compared to a hypothetical superscalar RISC-architecture. We 



Operation 

LD/ST 
LDt 
MOVE 
BITOP 
ADD/SUB 
MULT 
CJMP 
Cycles 
Count 
Issue Width 
Total Cycles 



Cycles 

T 

2 
1 
1 
1 
1 
2 
3 



Bfly Setup Butteif^ Min Setup 



5 
3 



1 
12 



IT 

8 

4 

4 
10 
20 



— nr 

128 
4480 



Min Search 
^ 



Norm Setup Normalize 



1 



64 
352 



1 



-1 
6-4 
3ZO 



Est RISC cycles 
5168 RISC Cycles 
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assume an average issue width of two which means that the RISC on average executes two operatioits 
in parallel. The estqcnate is achieved by counting instructions for the source code in 5.4.2. 
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5.5 MPEG2 encoder/decoder 



5.5.1 Quantization / inverse Quantization (quant.c) 



The quantization file contains routines for quantization and inverse ciijantization of 8x8 macro blocks. 
These functions differ for intra and non- intra blocks and furtbennore the encoder distinguishes be- 
tween MPEGl and MPEG2 inverse quantization. 

This gives a tota.1 of 6 functions, which are all candidates for fiinction inlining, since they do not use 
the XPP capacity by far. 

Since all functions have the same layout (some checks, one main loop running over the macro block 
quantizing with a quantization matrix), we concentrate on "iquant_iiitra", the inverse quantization of 
intra-blocks, siace it contains all elements jfound in the other procedures (The non_intra quantization 
loop bodies are more complicated, but add no compiler complexity). In the source code the mpegl part 
is abready inlined, which is straightfonvard since the function is statically defined and contains no 
function calls itself. Therefore the compiler inlines it and dead function elimination removes the vrhiole 
definition. 



void iquant_antra (src, dst, dcjprec, quant_mat,mquant ) 
short *sxc, *dst; 
int dc_prec; 

unsigned chao: *quant_mat; 

int mquant; 

{ 

int i, vai , sum; 

if (mpegl) { * • 

dstIO].=^ src'[0] « {3-dc_pxec); 

for (i«i; i<64; i++) 

{ 

val = (int) (src[i]*quantjmat(i]*mquant)/16; 

/* mismatch, control */ 
if ((val&l).===0 && val!=0) 
. val+= (val>0) ? -1 : 1; 

/* siatiaration */ 

dst[i] « (val>2047) ? 2047 : ( (vaK-2048) ? -^048 : val) ; 



sum.- ds-t[.0] =. src[0] « (3-cicjprec) 
for (i«l7 i<64; i++) 
{ . 

val = ( int ) ( sr c ( i ] *quant jma t [ i ] *mquant ) / 1 6 ; 

suin+« dstCi] = (val>2047) "? 2047 : ( (val<-204S ) ? -2048 : val); 



Original Code 



> 

else 
{ 



} 



/* misma-tch control */ 
if ((sum&l)===0) 
dstI63]'^« i; 



} 
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interprocedural Optimizations 

Analysing the loop bodies shows that they easily fit to the XPP and do not use the maximuin of re- 
sources by far. The fimction is called three times from module putseq.c. With inter-module function 
inlining fte code for the function call disappears and is replaced w-ith the function Therefore it reads 

for (Jc=0; 3c<inb_height*inb_widthi; k++) { 
if (mbinf o[k] .mb^type & MB_rNTRA) 
for (j=0; j<block_count; j ++) 
if (mpegl) { 

• blocks [k*block_count+3 ] [0] = blocks [k*blo ck_count+j ] [0] « 

(3-dc_prec) ; 

for {.i=l; i<64.; i++) { 

val = (int) ( blocks C. k*block_count+j ] [i] *intra_q[i] *mquant) /r 6; 

} else { • 
suia = blocks [k*block_count+j ] [0] = blocks [k*block_count+j] [0] « 

(3-dc prrecj ; 
for- (i=l; i<64; i++) { ^ 

v-al = (int) ( blocks [ 3c*block_count+j] [i] ^ intra_q [i]*mquant) /16; 

} 

} 

} else { 



} 

Basic transformations 

Since global nxpegl does not change within the loop unswitching moves the control statement outside 
the j loop and produces two loop nests. 

for .(k=0; k<iub_height*nib_width/ k++) { 
if (iribinfo [k] .inb_type ' & fffi^INTRA) 
if .(mpe<gl) 

for ( j=0; j<block_count; j++) { 
blocks [k*block_c6unt+j].[0] = blocks [k*bloc:k_count+j] [0] « 

(3-dcjprec); 

for (i.=»i; i<64; i++) { ■ 

. val - (int) ( blocks [l<*block_count+j] [i]^intra_q[i] *inquant)/ie; 

} ' ' 

} ■ . 

else ' 

. for (j=0;- j<block_coiint; j++) .{ 

sum = blocks [k*block_count+j ] [0] blocks [ k*block_count+j ] [0] « 

(3-dc_pr"ec) ; 

for (i=l; i<64;'. i++) { . 

val = (int) ( blocks [3c*block_count+j ] [i] * intra_q [i] *mquant) /^16; 

} 

} 
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Fuitheroiore ttie following transformations are done: 

■ A peephole optimization reduces the divide by 16 to a right shift 4. This is essential since we do 
not consider loop bodies containing division for the XPP- 

■ Idiom recognition reduces the statement after the ^'saturation'' comment to 
dst [i] =» niin{max(valf —2048), 2047) 

Increasing parallelism 

Now we want to increase parallelism. The j-i loop nest is a candidate for unroll-and-jam when the 
interprocedural value range analysis finds out that block_count can only get the valines 6,8 or 12. 
Therefore it has a value range [6,12] with the additional attti"bute to be dividable by 2. Xhus an unroll 
and jam with the factor 2 is applicable (the resource constraints would choose a bigger value). Since 
no loop carried dependencies exist, this transformation is safe . 

It is to say that tiie source code contams a manually peeled first iteration. This peeling bias been done 
because the value calculated for the first block value is completely different from the other iterations 
and the control statement in the loop "Avould cause a major peirformance decrease on tradi^tional proces- 
sors. Although* this does not prevent unroU-and-jam (because there are no dependencies; between the 
peeled of first iteration and the rest of the loop), the transformation must be prepared to handle such 
cases. 

After unroll and jam the sQurce code looks like (only one of the nests showed and the peeled first it- 
erations moved in front) 

foir (j=0; j<block_count; j+==2) { 

t:>locks[k*count+j].[0] = blocks [k*count+j ] [0] « {3-dcjprec) ; 

t>locks[k*count+j+l] [0] = blocks [k*count+j+l] [0] « O-dcjprec:) ; 
for (i=l; i<64; i++) { 



val = (int) {blocks Ik^couht+j] [i]*ihtra_q[i]*inbinfo[k] .mquarat) »4; 

/* mismatch control "*/ 
if ((val&l)=-0 && vaa.!=0) 
val+=» (val>0) ? -1 : 1; 

/* saturation */ 

blocks [k*count+j ] [i] = min(max(val, -204'8) , 2047); 

val = (int) (blocks [k'^count+j+l] [i] *int3ra_q[i] *mbinfo [k] .iaqi3.ant) »4; 

/* mismatch control '^Z ■ . 
if ((val&l)=0 && val!=0) 
val+= (val>0) ? -1 : 1; 



/* saturation */ 

blocks [k*cbunt+j+l] [1.] = min (max (val, —2048) , 2047); 

} 

} ' • 

Further parallelism can be obtained by index set splitting. Normally used to break depencdence cycles 
in the DDG, it can here be used to split the i-loop in two and let two sub-configuration^ woiic on dis- 
tinct blocks of data. Thus die i loop is split into 2 or more loops which work on different subsets of the 
data at the same time. 



^ sub-configuration is chosen as a workuig i:itle for configurations which contains independent netwoxks that do 
not interfere: 
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Handling the data types 

In contrast to the FIR-Filter, edge detector and matrix naultiplication benchmarks, which all use data 
types fitting perfectly to the XPP^ the MPEG2 codec uses all data types commonly used on a proces- 
sor for desktop applications. Written for the Intel x86 and comparable architectures, we must assume 
that the sizes of char, short aond int are 8,16, and 32 respectively. Assuming that tlie XPP has a bit 
width of 32 we must ^e precautions for smaller data types. 

Therefore we split the stream of data packets with each packet containing 2 or 4 valxies of the shorter 
data type into 2 or 4 streams. If we have enough resources left, this will came no performance penalty. 
Each of the divided streams is sent to its own calculationi network^ therefore in every cycle two short 
or four char values are handled. Nevertheless this causes an area penalty, because besides the split- 
xnerge elements, the whole data flow graph has to be duplicated as often as needed. Figure 64 shows 
how short values are handled. The packet is split into its tii- and lo part by shift opeia-tions and merged 
behind the calculation branches. The legality of ftis transtformation is the same as with loop unrolling 
witih an unrolling factor as big as die data type is smaller as the architecture data type. 

Unfortunately this is not the end of the pole. The compileK: further has to assure that ervery intermediate 
resvilt which produces an over/under-flow for the shorter data type does the same with the bigger data 
type. Therefore it has to msert clipping operations which, assure that the network calculates with real 
16 or 8 bit value, respectively. 








16 




J. 1 Oxffff 




Figure 64 Spring short values into two streams and merging them 
after the calculation. This method causes ' no petformance penalty 

^ We assume that the size of int is chosen to be the XPP architectiore data bit width. Everything else would not 
lead to any feasible result 
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If the configuration size does not allow the M^iole loop body to be duplicated or dependencies prevent 
this, we still have the possibility to merge the split values again. This of course causes a performance 
penalty to the previous solution, because the throixghput is only one (short) value/cycle now.Figure 65 
shows how the merge is done. Instead of streaming parallel through two netw^orks the values are seri- 
alized and de-serialized again after the network. 




Figure 65 Merging the split values before the network An event g^rn 
erator drives the merge anddemux PAEs. This figure replaces th& 2 
black boxes labeled "rretwork" in Figure 64 



5.52 Invei3e Discrete Cosine Transformation (idctc) 

The idct^algorithm is used for the MPEG2 video decompression algorithm. It operates on 8x8 blocks 
of video images in their frequency representation axid transforms them back ui~to their original signal 
form. The MPEG2 decoder contains a transform-function that calls idct for all l>Iocks of a frequency- 
transformed picture to restore tiie original image. 

The idct function consists of two for-loops. The first loop calls idctrow - the second idctcol. Function 
inlining is able to eliminate the &nction calls within the entire loop nest stmctiare so that the numeric 
code is not interrupted by function calls anymore. Another way to get rid of fiinotion calls between the 
loop nest is loop embedding that pushes loops from the caller into the callee. 

Original Code (idctc) 

/* two diiaensional inverse .discrete cosane transform */ 
void idct (block) 
short *block; 
{ 

int i; 

for (i=0; i<8; 

idctrow (block+8*i) ; 



for (i=0; i<8; iH-+) 
idctcol (block+1 ) ; 

} 
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The first loop changes the values of the block ro%v by row. Afterwards the changed block is further 
transformed column by column. All rows have to be finished before any column processing can be 
started. 



xfdctrow 



8 X idctcol 



result 



Dependency analysis detects true data dependencies between row processing and column processing. 
Therefore the processing of the columns has to be delayed until all rows are done. The innermost loop 
bodies idctrow and idctcol are nearly identical. They process numeric calculations on eight input val- 
ues (column values in case of idctcol and row values in case of idctcol). Eight output values are cal- 
culated and written back (as column/row). Idctcol additionally applies clipping before the values are 
written back. This is why we concentrate on idctcol: 



/* column (vertical) IDCT 
* 

7 . pi 1 

* dst[8*k] « sum c[l] * src[8*l] * cos( — * ( k + - ) * 1 ) 

* l-O 8 2 

* ' 

* where: c[0] = 1/1024 

* c[1..7] « (1/1024) *sqrt (2) 
*/ 

static void idptcol (blk) ' 
short *blk; 
{ * 

int xO/ xl, x2^ x3, x4, x5, x6, x7, x8; 



/* shortcut */ 

if (!((xl= (bak[8*4]«8) ) I (x2 = blk[8*6]) | 

(x3 »blk[8*2]) I (>[4 = blk[8*l]) | (x5 « blk[8*7']) | 
(x6 = bl>c[8*5]) I (x7 =^ blk[8*3]))) 

{■ 

blk[8*0]=blk[8*l]=blk[8^2]=blk[8*3]=blk[8*4]=blk[8*5] = 

blk[9*6]«blk[8*7]«iclpt (blk[8*0] +32)»6]; 

return; 

} 

xO - (blk[8*0]«8) + 8192'; 

/* first stage */ 

X8 = W7^(x4-fx5 3 +. 4;. 

x4 = (x8+(Wl-W7) *x4)»3; 

x5 = (x8-(Wl+W.7)*x5)»3; 

x8 = W3* (x6+x7 ) + 4; 

x6 =» (x8-(W3-Wr5)*x6)»3; 

x7.= (x8-(W3¥Wr5)*x7)»3; 



/* second stage V 

x8 = xO + xl; 

xO -= xl; ■ 

xl = W6*(x3+x2) + 4; 

x2 = (xl-(W2+W6)*x2)»3,- 

x3 = (xl+(W2-W6)*x3)»3; 
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xl « x4 + x6; 
x4 -= x6; 
x6 » x5 + 5c7; 
x5 x7; 

/* third stage */ 
x7 - x8 + 2c3; 
x8 -« x3; 
x3 xO + 2c2; 
xO -= x2; 

x2 « {181* (x4+x5)4-128)»fl; 
x4 « (181* (x4-x5)+128)»8; 

/* fourth stage */ 

blk[8*0] = iclp[(x7+xl)»14] 

blk[8>l] = iclp[(x3+x2)»14] 

blk[8*21 = iclp[ (xO+x4)»14] 

blk[8*3] = iclp[(x8+x6)»14] 

blk[8*4] = xclp[(x8-'x6)»14] 

blk[8*5] « iclp[(xO-x4)»14] 

blk[8*6] «= iclp[ {x3-x2)»14] 

blk[8*7] - ±clp[(x7-xl)»14] 
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- W7 are macros for numeric constants that are substituted by tJie preprocessor. The iclp array is 
used for clipping the results to 8-bit values. It is fully defined by tbie init_idct function before idct is 
called the first time: 

void init_i-dct() 
{ 

int i; 



iclp =» icilip+512; 
for (i= -512; i<512; i++) 
iclp[i] » (K-256) ? -256 : 



{(i>255) ? 255 : i) 





I 




A B 

SORTU 

X Y 


1 




3 




A B 
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A special kind of idiom recognition (function recognition) is able to replace the 
calculation of each array element by a compiler known function that can be re- 
alized eflRciently on the XPP. If the compiler features whole program memory 
aliasing analysis it is able to replace all uses of the iclp array withth« call of the 
compiler known function. Alternatively a developer can replace the iclp array 
accesses manually by the compiler known saturation function calls. Xhe illustra- 
tion shows a possible implementation for saturate(val,n) as NML scliematic us- 
ing two ALUs. In this case it is necessary to replace array accesses lilce iclp[i] by 
saturate(i,256). saturate(val.n) 

The /* short cut*/ code in idctcol speeds column processing up if xl to x7 is zero. This breaks the 
well-formed structure of the loop nest. Thte if-condition is not loop invariant and loop unswitching 
cannot be applied. But nonetheless - the code after shortcut handling: is well suited for the 3(PP. It is 
possible to synthesize if-conditions for the XPP (speculative processing of both blocks plus sele<:tion 
based on condition) but this would just vvaste PAEs without any performance benefit. Therefore the 
/*shdrtcut*/ code in idctrow and idctcol has to be removed majiually. The code snippet below 
shows the inlined version of the idctrow-loop with additional cache instructions for XPP control: 

void idct (block) 
short" *bloc3<; 
{ 

int i; 

XPPPreloa<a{IDCTROW_CONFIG) ; // Loop Invariant 

for (1=0; i<8; i++) { 
short *blk; 

int xO, xl, x2, k3, x4, x5/ k6, x7, x8; 
blk = block+8*i; 

XPPPreload(0, blk, 8); 

XPPPreiloadClean(l,blk,B); IRAMl is erased ax\d assigned to blk . 
XPEBxescuta ( IDCTROW CONFIG, I RAM ( 0 ) ; IRAM ( 1 ) ) ; 



} 

for 



(i=»0; i<8; i++) { 



} 

} 

As the configuration of the XPP does not change during the loop execution invariant code motion has 
moved out XPPPreload(IDCTROW JTONFIG) from the loop.. 
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NML Code Generation 

Data Flow Graph 

As idctcol is more complex due to clipping at the end of the calculations we decided to take Ldctcol as 
representative loop body for a presentation of the data flow gn^h. 

The figure on the next page shows flie data flow graph for the II>CTCOLUMN_CONFIG. A heuristic 
has to be applied to the graph to estimate the resource needs on the XPP. In our example the heuristic 
produces the following results: 
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Fortunately the data flow graph fits into an XPP64 and we can proceed without loop dissevering ^ 
(splitting the loop body into suitable chirnks) for this example. 



^ XPP-VC: A C CompUer with Temporal Partitioning for the PACT-XPP Architecture, J. M. P. Cardoso and 
Markus Weiiihardt 
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Address Generation: 

To fully synthesize the loop body we have to &ce the problem of address generation for a^ccessing the data. 

For IDCTCOLUMN^CONFIG we have to select the n^ elenaent of eve- 
ry row which means an address serial of (0,8,16. ..1,9,17... 7,15,23...)- 
We use two counts macros for address generation as show^ opposite. 
The upper counter increments by eight and the lower by one. The IRAM 
output is passed to the data flow graph of IDCTCOLUMN. tf all (eight) 
row elements of a column are available SWAP is switched through to 
the data flow graph input and the calculation for a new colunun begins. 

For Hie IDCTROW_CONFIG the address generation is very' simple as 
the ERAM already contains the block in the appropriate order (row after 
row as it has to be accessed). Again by using SIUP(stepped iterative up)- 
counter macros as described in the XPP tutorial it is possible to map 
linear address expressions to NML-code in a generic way. As 
IDCTROW_CONFIG accesses a two-dunensional array we need two 
SlUP-counters in the corresporicling NML code. The cohinm-elements 
have to be accessed row after row so the upper coimters increment is one 
and the lower counters increment is eight However, the NMX code for 
this access pattern (0..., 5,6,7,8,9,. ..63) can be reduced to one single 
counter (or to FIFO-mode IRAM access). 

Address generation for write access is implemented in the same 
manner. The resources have tp be updated to take this additional 
code into account It takes 2*(8+8+2*l) FREGs and. 2*(2+l) 
more BREGs in the worst case which is stUl available oil the XPP. 



If IRAM use is not critical it is also possible to distribimte the data on several IR-AMs to improve the 
memory throughput into the XPP-array. This task has to be done by the RISC-core or by a more so- 
phisticated XPP-cache controller. 




Further Enhancing XPP Utilization 

As mentioned at the beginning idct is called for all data blocks of a video im^e (loop in .transform.c). 
This circumstance allows us to . further improve the XPP utilization. 

When we look at the data flow graph of idctcol in detail we see that it forms a very deep pipeline. If 
we bring back to our mind that the IDCTROW_CONFIG runs only eight times . on the XPP which 
meant that only 64 (8 times 8 elements of a column) elements are processed throng this pipeline and 
that we have to wait then until all data left the pipeline before we can change the JCPP configuration to 
the IE>CTCOLUMN_CONFIG configuration to go on with column processing then it gets obvious that 
something is suboptimal in our example. 

Problem (Pipeline Depth) 

The pipeline is just too deep for processing only eight tinges eight rows. Filling 
and flushing a deep pipeline is expensive if only little data is processed with it 
First the units at the end of the pipeline are idle and then the units at the begin 
are unused. 

Solution (Loop Tiling) 

It is profitable to use loop interchange for moving the dependencies between row^ and column proc- 
essing to an outer level of the loop nest The loop that ca.lls the idct-function (in transform.c) on sev- 
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eral blocks of the image is has no loop interchange preventing dependencies. Therefore this loop 
be moved inside the loops of column and vow processing. 



can 
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// transform, c 



P for (n=0; n<b2ock_count; n++) { 

idct (blocks [k*block_count+n] > ; // block_count- is 6 or 8 or 12 

} 



// idct»c 

/* two dimensional inverse discrete cosine transf"orm */ 
void idct (block) 
short *block; 



{ 



int- i; 

for (i=0; i<8; i++) 



idctrow{biock+8*i) ; 



for (i=0; i<8; i++) 



idctcol{b J.ock+i) ; 



'} 



Now the processing of rows and colunms can be applied on more data (by applying loop tiling) and 
therefore filling and flushing the pipeline can be neglected. 

Constraints (Cache Sensithfe Loop Tiling) 

The caching hierarchy has to be taken into account wlien we define the number of blocks that will be 
processed by the IDCTROW_CONFIG. Remember, we need the same blocks m the subsequent 
IDCTCOLUMN_C0NFIG configuration! We have to take care that all blocks that are processed dur- 
ing IDCTROW_CONFIG fit into the cache. Loop tiliag has to be applied with respect to the jpache size 
so that the processed data fits into the cache. 

II^M reuse between different configurations 

This example implies another bandwidth optimization that is just a input IRAM 
more consequent version of loop tiling. Instead of transferring data 
from row processing to column processing via the meniory hierarchy 
(cache sensitive loop tiling takes care that only the cache memory is 
accessed) we can completely bypass die memory interface by using 
tiie output IRAM of Config A as input IRAM of Config B. 

Putting all together 

If we apply cache sensitive loop tiling, IRAM reuse and function in- 
lining we can further optiinize our example: 

Finally the idct-fimction gets completely uilined ia transform.c. If 
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block_count is e.g. 6 and we assume that 64*6 words do not exceed the cache size then we can trans- 
form the example to: 

// transform. c 



block = blocks [Jc*6] ; 
XPPPreload(IDCTROW_CONFIG) ; 

XPPPreload(0,.block, 64*6) ; // IRAMO gets 64 words from 6 blocks 

XPPPreloadCleaii( 1, block, 64*6) ; // erase IRAMl and assign to the 6 blocks 
XPPExec2ute ( IDCTROW_CONFIG, IRAM ( 0 ) , IRAM ( 1) ) ; 

XPPPreload ( IDCOL UMN_CONFIG ) ; 

XPPPreload deblock, 64*6); // reciundant -> will be eXiminated 

XPPExecute (IDCOL UMN_CONFIG, IRAM (1) , IRAM(2) ) ; 



The address generation in IDCTROW_CONFIG and rDCOLUMN^CONFTG has to be modified for 
reflecting the different data block size - caused by loop tiling - that has to be processed. This can be 
implemented by an additional SUIP counter that generates the block ofifeets inside the tiles. 



lock offset 




bIocK.count = 6 



The table contains architectural parameters for UDCTROW^CONFIG and EDCOLUMN^CONFIG of 
the final result. It relies on a cache that is able to store block_count blocks. Jis two configurations are 
executed in this example the configuration cycles have to be taken twice aod therefore the total con- 
figuration cycles are 2 x (block_count x 64 + (12 H- 2 x 8) x 2). 



Parameter 


Value 


Vector length 


Swords 


Reused data set size 


block_count x 64 vyords 


I/OIRAMs 


3 (one shared^ 


ALU 


45 FUs 


BREG 


41 FUs 


FREG 


36 FUs 


Data flow graph width 


8 


Data flow gjaph height 


12 


Configuratioii cycles 


block^count x 64 + (12 H- 2*8) x 2 



Performance Considerations 
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In tiiis example it is possible to exploit high data locality which means that many operations are per- 
formed on a limited memory range. The performance of the proposed XTP solution is compared to a 
hypothetical superscalar RISC-architecture. We assmne an issue width of two which means that the 
RISC executes on average two operations in parallel. 

Ops for Row/Column Est RISC cycles 

LD/ST 16 2 32 

ADRCOMP 16 1 16 

ADD/SUB 35 1 35 

MULT 11. 2 22 

SHIFT 18 1 18 

SAT 8 4 32 

Issue Width 21^ 
Cyc/Row(Col) 

Proc. Rows 8 620 

Proc. Cols 8 620 

RlSCCyc^lk TSTO" 

.^PPCyc/Blk 128 with data duplication+reofdering 24 

Speedup 10 with data duplicatlon+redrdering 52 



Even though speedup is reasonable it gets obvious that fetching the inpvt data from a single IRAXI 
(which means that we have to feed the eight mputs in eight cycles before processing is started) reduces 
the potential speedup significantly. With other >vords we have a pipeline that is able to process eigfat 
input values per cycle but we are loading the pipeline only every eightii cycle. This causes that only 
every eighth pipeline stage is filled. The figure below illustrates this: 



without with 
ata duplication data duplication 



Full utilization can be achieved only by loading the eight input values of the pipeline in one cycle. A 
simple solution to improve the memory thiougliput to the pipeline is data duplication as described in 
the hardware sectioa. 

Instead of loading tlxe six 8x8 blocks to a singlis IRAM we use theXPPPrcIoadMultiple command to 
load the eight IRAMs with the same contents: 

XPPPreload(0,block,64*6) ; // IRAMO gets 64 words farom 6 blocks 

is changed to: 

XPPPreloadMultiE>le(OxFF, block, 64x6) // load IRAMO up "to IRAM7 with blocks 

Now the pipeline gets fully utilized and we also have to store eight results per cycle. This can be 
achieved by writing eveiy^outputrvalue to another IRAM which additionally takes eight more IRAMs 
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(using data duplication in this example needs all 16 IRAMs of the XPP64). For storing the data tliat is 
generated with Ii:>CTROW_CONFIG we have to change: 

XPPPreloadCl^an(l,blockr 64*6); // erase IRAMl and assign to the 6 blocks 
to: 

tmpsize = 64*6/8; 
XPPPreloadCl^an ( 8, 
XPPPreloadClean ( 9 , 
XPPPreloadCle^ ( 10 , 
XPPPreloadClo^n (11, 
XPPPreloadCIe^ (12 , 
XPPPreloadCle^ ( 13 , 
XPPPreloadCle^n ( 14 , 
XPPPreloadClean ( 15 , 

This causes different data layouts for the intermediate results. We need an additional configuration 
(RE0RDER_COlSFIG) to restore the original data layout. 
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Again address generation has to be modified to fetch eight input values per cycle. This on the one Eiand 
requires seven additional adders, but on the other hand avoids swaps, aod latches for keeping the data 
ei^t cycles. 

Data duplication and data reordering finally transforms the example code to: 
7/ transform, c 



block = blocks [k*6]; 
XPPPraload ( I DCTROW_CONFIG) ; 

XPPPreloadMaa-tiple(OxFF, block, 64x6) // load IRAMO u^^ to IRAM7 with blocks, 
tmpsize = 64*6/8'; // result gets seperated into 8 IRAMs 

XPPPreloadClesui { 8, block+0*tmpsi ze, tmpsize); // IRAr>48 for interm. Rslt 1 

XPPPreloadClean ( 9, block+l*tmpsi ze, tmpsize); // IRAM9 for interm. Rslt 1 

XPPPreloadCleandO, bloc k+ 2* tmpsize, tmpsize); // IRAIMIO for interm. Rslt 1 

XPPPrsloadCleandl, block4-3*tinps± ze, tmpsize); // IRAMll for interm. Rslt 1 

XPPPreloadClean(12, block+4 *tmps i ze, tmpsize); // IRAiyil2 for interm. Rslt 1 

XPPPreloadCleaii(13, ■block4-5*tmpsi ze, tmpsize); // IRAM13 for interm. Rslt 1 

XPPPreloadClean(14,. block+6*tmpslze, tmpsize); // IRAM14 for interm. Rslt 1 

XPPPreloadClesm(15, block+ 7* tmpsize, tmpsize); // IRAlSfllS" for interm. Rslt 1 
XPPExecute(IDCTROW_CONFIG, IRAM(0-7 ) , IRAM(8-15) ) ; 



XPPPreload(IDCOLUMN_CONFIG) ; 



XPPPreloadMultiple ( 


OxFF, block, 6 


4x6) 


// Id IRAMO- 


-IRAM7 \ 


irfith 


interm. 
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XPPPreloadClean(13, block+5 ^tmpsize, tmp^ize) ; /J IRAM13 for interm- Rslt 2 

XPPPreloswiClean(14, block+6*tmpsi2e, tmpsize) ; I J IRAM14 for interm. Rslt 2 

XPPPrelo&dClean(15, block+7 * tmpsize, tmpsize); IRAM15 for interm, Rslt 2 
XPPBxecu-te(IDC0LUMN_CONFIG, IRAM ( 0-7 ) ,IRAM (8-15) ) ^ 

XPPPrelosd(REORDER_CONFIG) 

XPPPreloadMultipleCOxFF, block, 64x6) // Id IRAMO-IRAM7 with interm, Rslt 2 
rsltsize = 64; // 64*6/6; 

XPPPreloadClean ( 8, block+0*rsltsize, rsltsize) ; // IRAM8 for final Rslt 

XPPPreloa.dClean ( 9, block+l*rsltsize, rsltsize); // IRAM9 for final Rslt 

XPPPreloadClean(10, block+2*rsltsize, rsltsize); // IRAMIO for final Rslt 

XPPPreloadCleandl, block+3*rsltsize, rsltsize); // IRAMll for final Rslt 

XPPPreloadClean(12, block+4 ^rsltsize, rsltsize); " // IRAM12 for final Rslt 

XPPPreloadClean(13, block+4*crsltsize, rsltsize); // IRAM13 for final Rslt 
XPPExecuteCIDCOLUMN CONFIG, XRAM ( 0-7) , IRAM( 8-13) ) ; 
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5.6 Wavelet 



5.6.1 Original Code 



• vo±d f orward_wavelet { ) 
{ 

±nt x,nt, *cimid; 

iht *sp^ *dp, d_tinpO,_d«.tmpl., d_tmpi, s^tzmpO, s^tmpl; 

±nt mid, ii; 

±nt *x; 

±nt s[256],d[256]; 



£or (nt*C0L/nt>=BLOCK_SIZE;nt»=l) { 
for (i=0;i<nt*COL/*tmp_nt*y;i+=COL) { 

X = &int_^data[±3 ; 
mid'=(nt>>l)~l; 

s[0] = x[0]; 

d[0] - x[ROW].; 

s[l] = x[2I; 

s[mid] = x[2*mici]; 

d[mid] = x[2*mici+R0W] ; 

d[0] = (d[0]«l)-s[0]-s[l]; 
s[0]=s[0] + (d[0]»2) ; 

d_tmpO = d[0]; 
s_tmpO = s[l]'; 

for(ii=l; iKmici; ii++) { 
■ s_tmpl = x[2*ii+2]; 
d_tmpl =( (x[2*±i+R0W] - s_tmpO - s_tmpl; 

d[ii] = d_tmpl;* 

s [ii]= s_trapOH- { (d_tmpO+d_tmpl)»3)r* 
d_tmpO = d^tmpl; 
s^tmpO = s_tinEDl; 

} 

d [mid] = { d [mid] -s [mid] ) «1 ; 

s [mid]=s[jmid]+( (d[mid-l] +d[mid] ) »3) ;r 

for(ii=0; ii<=mid; ii++) { 
x[iil=s[ii]; 
xUi+mid+l]=d[ii]; 

} 

} 

for {i=0;i<nt;i++) {• 

X = &int_data[i] ; 
mid«(nt»l)-l; 

s[0] -= x[0].; 

d[0] = x[COL]; 

s[l] = x[CdL«13; 

s[mid] = x[(COL«l)*raid]; 

d[mid] = xt(COL<:<l)*mid +eOL] ; 
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d[0] = (ci[0]«a)-s[O]-s[l]; 
s[0]=s[0] + (ci[0]»2); 



d_tmpO « d[OJ ; 
s_tmpO = s[13 ; 
for(ii=l; ii<ciiiid; ii++) { 
s_tmpl = X C2*C0L* (ii+1) ] ; 

d_tmpl =|x C2*C0L*ii+C0L]«l) - s_tmpO - s^tmpl; 
d[ii] = d_jt::mpl; 

s [ii] = s_tinpO+ ( (d_tmpO+d_tmpl) »3) ; 
d_tnipO - d tmpl; 
s^tmpO = s ^tmpl; 

} 

d[mid] = {d[ni±ci]«:l) -(s[mid]«l); 

s [mid] =s [midD + ( (d [mid-1] +d [mid]) »3) ; 

for(ii=0; ii-<=mid; ii++) { 

x[ii*COL]=s [ii] ; 

x[ (ii+mid+1) *COL]=dtii] ; 
} . 

} 

} 

} V 



5.6.2 Optimizing the Whole Loop Nast 

After pre-processing and application of copy propagation over s^tmpl, d_t=mply we obtain the fol- 
lowing loop nest 

void forward__wavelet 0 
{ 

int i,nt/ *dmid; 

int *sp, *dp/ d^tmpO, d_tnipl, d_tmpi, s_tmpO/ s^tmpl; 
int midr ii; 
int *x; 

int s[256],d[256] ; - 

for (nt=64;nt>= a_6;nt»=l) { 
for (i=0;i<nt*64;i+=64j { 

X = &int_dat^ [i] ; 
mid=(nt»l)-3-; 

s[0] - x[0]; 

d[6]. = x[l]; 

s[l] = x[2]; • 

s [mid] = x[2'*mid] ; 

d[iaid] = xt2*mid+l]; 

d[0] = (d[0]«3.)-s[0]-s[l]; 
s[0]«s[0] + (dC0]»2); 

d__tinpO = d[0 3 ; 
s_tmpO = 3[12 ; 
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for(ii=l; ii<mid; ii++) { 

d[ii] ( (x[2*ii+l])«i) - s_tmpO - x[2*ii+2]; 
s[ii]= s_tmpO+ ( (d_tinpO + ci[ii])»3); 
d^tmpO .d[ii] ; 
s_tmpO s [ii] ; 

} 

d[mid] = (ci Cmid] -s [mid] )«1; 

s [inid]=s [mid] + ( {d[inid-l]+ci [mid] ) »3) ; 

for(ii=0; li<=mid; ii++) { 
x[ii]=sCii]; 
X [ii+mid+l] ^d [ii] ; 



d[0] = x[64]; 
s[l] - x[128]; 
s[inid] = x: [128*mid] ; 
d[mid] = x: [128*mid +64]; 

d[0] = (d[O] «l)-s[0]-s[l]; 
s[0]=s[0]+ (d[0]»2); 

d_tmpO = d. [0] ; 
s_tmpO « s [1] ; 

for|ii«l; ii<niid/ ii++) { 

d[ii] «(x[128*ii+64]«l) - s^tmpO - x[128* (ii-h 1) ] ; 

s[ii]= s__tmpO+((d tmpO + d [ii])»3); 

d__tmpO = d[ii] ; 

s_tmpO = *s[ii] ; . 



d[mid]'=(d[mid]«l) - (s [inid]<-<l) ; 
s[mid]*«s [m±d] + ( td[iidd-l]+d[md.d] )»3); 
for(ii=0; ii<=mid; ii++) { 

x[ii*64]=s [ii]'; 

x[ (ii+mici+l)*64]'=d[ii]; 
) • 

} 

} 



Then we have 4 tables, one for each innennost loop. The tables for the first and the third loops a 
identical, as are the tables for the isecond and the fourth loop. We have the followmg two tables. 



for (i=0;i<nt;i++) { 
X = &int_ciata[i] ; 
mid=(nt»a-)-l; 



s[0] » x[0]; 
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Vef5tnr 1en@th 


ni'vd-'? 


Reused data set size 




I/O IRAMs 




ALT! 




BREG 


o 


FREG 




Data flow graph width 


:2 


Data flow graph height 




Configuration cycles 


6+(inid-2) 



Parameter 


Value 


Vector length 


vcxid 


Reused data set size 




I/OIRAMs 




ALU 




BREG 


O 


FREG 


C 


Data flow graph width 




Data flow gr^h height 


1 


Configuration cycles 


nid 



The two inner loops do not have the same iteration range and could be candidate for loop fusion, there- 
fore the first and last iterations of the second loop are peeled off. The. surrounding code between the 2 
loops can be moved after the second loop, then we obtam the following code for the loop nest. 

for (nt=64/nt>= 16;nt»==l) { 
for {i=0;i<nt-^64-H-f=64')— {• 
X = &int_data [i] ; 
mid=(nt:»l)-l; 

's[0] - xCO]; • 
d[0] = x[l]; 
s[l] = x[2]; 

s[mid] = x[2*mid]; 
d[mid]- = x[2*mid+lj; 

d[0] = (ci[O]<<l)-s[a]-s[l]; 
s[0]=s [0] + (d[0]»2) ; 

d_tmpO = d[0]-; 
s_tmpO = s [1] ; 
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foxr(ii=l; ii<mid; { 

ci[ii] (x[2*ii-M] ) «1) - s_tir^O - x[2*ii.+2]; 
s[ii]=- s_tiiipO+ ( (d_tinpO + d[ii])»3); 
<i__tmpO = d[ii]; 
s_tmpO = s[ii]; 



forr(ii=l; iKmid; { 
x:[ii]=s[ii]; 
X [ii+mid+l]=d[ii]; 

} 

d[iaid] = {d[mid]-s[mid] )«1; 

s [mid] =s [mid] + ( (d ImicL-l] +d [mid] ) »3 ) ; 

x[0]==s[0]; 
x[mid+l]=d[0] ; 
X [mad] =s [mid] ; 
x[2*mid+l]= d[mid]; 

> 



for (i=0;i<nt;i++) { 
X = &int_data[i] ; 
iaid=(nt»l)-l; 

s[0] « x[0]; 
d[01 •« x[64]; 
s[l] = x[128] ; 
s[m±d] = x[128*mid]; 
d[m±d] - x[128*mid +64]; 

d[0]=(d[01«l)-s[0]-s [1]; 
s[O3=st01 + (d[0]»2); 

d^tmpO = d[0] ; 
s^tmpO = s [1] ; 
fox Cii=l; ii<mid; iif-i-) { 

dCii] =(x[128 *ii+64]«l) - s_tmpO - x[128 *(ii+l)]; 

s [ ii] = s_tmpO+ ( (d_tmpO+d_tmplT»3) ; 

d__tmpO - d[ii]; 

s_tmpO. = s[ii] ; 
). • 

for{ii=»l; ii<mid; ii^-i-) { 
x[ii*64]=s[ii]; 
x[ (ii+mid+1) *64]=d [1.1 J ; 

} 

d[mld] = (d[mid]«l) -(s tmid]«l) ; 
3[mld]=s[mid] + { (d[mid-l] +d[mid] ) »3) ; 

x[0]=s[0]; 
x[ (mid+l)*64]=d[0]; 
x[mid*64]=s [mid] ; 
x[ (2*mid+l)*64]«d[raid] r 



After loop peeling the only change on. the parameters is the vector length. The tables becom-e: 
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Para m eter 


Value 




mid-2 


IVCUdwU MOW oCl aUw 






6 


AI II 


2 




0 


FREG 


.2 


Data flow graph widtli 


2 


Data flow graph height 


6 


Configuration cycles 


6+(mid-2) 




Parameter 


Value 


Vector length 


mid-2 


Reused data set size 




I/OIRAMs 


6 


ALU 


0 


BREG 


0 


FREG 


0 


Data flow graph width 


2 


Data flow graph height 


1 


Conflguration cycles 


mid-2 



The fusion of the inner loops is legal as there would be rko loop-carried dependences between the in- 
stroctions formerly in the second loop and the instructioiis formerly in the first loop. We obtain the 
following loop nest 

f03r (nt=64;nt>= 16;nt»=l) { 

for (i=0;i<rLt*64 /* -tmp_nt*/; i+=64 ) { 

X = &int_data[i] ; 
mid=(nt»l)-l; 

s[0] = x[0]; 

d[0] = x[l]; 

s[l] = x[2]; 

s[niid] = x[2*niid] ; 

d[mid] = x[2*mid4-l] ; 

d[0] = (d[0]«l)-s [ 0]-s[I]; 
s[O]-s[O] + (d[0]»2) ; 

d_trapO « d[0]; 
s_tmpO = s[i]; 
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for(ii=l; d.i<mid; ii++) { 

ci[ii] = < (x[2*ii+l])«l) - s^tmpO - x[2*ii+2]; 
s[ii] = s_tmpO+ { (d_tmpO +d[jLi])»3); 
d__tmpO d [ii] ; 
s_tiripO s [ii] ; 
x[ii} = s[ii]; 
x[ii+inid-M] = d[ii]; 

} 

d [mid] = (d [xnid] -s [mid] ) «1 ; 

s [mid] =s [m±d] + ( (d [mid-1] +d [raid] ) »3 ) ; 

x[0]=s[0]; 
x[iiiid+l]==d CO] ; 
x[mid]=s[mxd] ; 
x[2*mid+l]== d[mid]; 



for {i=0;i<nt;.i++) { 

' X = &int_data [i] ; 
raid=F(nt»l) -1; 

s[0] = x[0] ; 
d[0] = x[64]; 
s[l] = x[128]; 
s[mid] =x[128*mid]; 
d[mid] « xC128*mid +64]; 

Ci[0] = {d[0]«l)-s[0]-s[l]; 
s[0]-=s[0]+ Cd[0]»2) ; 

d_tmpO = d\C01; 
s_tmpO = s CI] ; 

for(ii=l; i_i<mid; ii++) { 

d[ii] =(jc[128*ii+64]«l) - s^tmpQ - x[128* (ii+1) ] ; 

s[ii]= s_tmpO+( (d_tn^O + d[iii)»3); 

d^tmpO = d[ii] ; 

s_tmpO = s [ii] ; 

x[ii*64]=sfii]; 

x[ (ii+mica+l) *64 ] =d[ii] ; 

} 

d[mid] = (d[inid]«l) -(s[mid]«l)^ 
s[mid]=s[mi.d] + | (d[mid-l]+d[mid] )»3); 

x[0]=:s[0]; 

x[.(mid+l)*64]=d[0] ; 
x[raid*64]=s [mid] ; 
x[(2*mid+l) *64]=d[mid]; 



} 



} 



After loop fusion, we only have two loops, that have the same parameter table. 
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Parameter 


Valae 


Vector lenEfth 


inid*2 


ReiiseH data. SBt sitjr 






8- 


ALU 


6 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


6 


Configuration cycles 


6+(ini(1.2) 



When performing ^alue range analysis, the compiler finds that ti/ ranges takes the values 64, 32 and 
16. Hie upper bound of the iimer loops is mic^, which depends on the value of nt. The analysis fin<is 
then that mid can take the values: 31, 15 and 7. Loops with constant loop bounds can be handled mc>re 
efficiently on Ae PACT XPP. This means thait the inner loops can be "better optimized if mid is re- 
placed by a constant value. This will happen when the outer loop is unrolled. This way we will obtain 
a bigger code, but ^ifh 3 instances of the loop nest, each being candidate for a configuration. This e an 
be seen as.a kind of temporal partitioning. Thus the outer loop is completely unrolled giving six new 
loop nests. 

for (i=0;i.<4096;i+=64) { /* nt=64 */ 

X = &int:_data[i] ; 
mi-d=31; 

s[0] = xi[b]; 
d[0] - 3C[1]; 
s[l] = x[2]; 
s[3I] = x[61]; 
d[31] = x[63]; 



d[0]«(dC O]«l)-s[0]-s[l]; 
s[0]=s[O] + {d[0]»2); 

d^tmpO = d[0]; 
s_tn\pO = -s [1] / 

for{ii=a-; ii<31; ii++) { 

dfii] =•( (x[2*ii+l])«l) - s_tmpO - x[2*ii+2]; 

s[ii]= s_tmpO+ ( (d_tmpO + dCii])»3); 

d_tmpC) = d[ii] ;■ 

s^tmpO = s [ii] ; 

x[ii]=s [ii] ; 

x[ii+32]=d[ii]; 

} 

d[31] = (ca[31]-s[31])«l; 

stSll-s C31] + ( (dt30]+d[31])»3) ; 
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x[0]=s[O]; 
x[32]=d[0]; 
x(31]=s(31]; 
x(63]«d[31]; 



for (i=0;i<64;i++) { 

X =» &i-nt__ciata[i] ; 
inid«3l.; 

s[0] = x[0]; 
d[0] = x[64]; 
s[l] « x[128]; 
s[31] « x[3968]; 
d[31] x[4032]; 

d[0]=(d[O]«l)-s[O]-s[l] / 
s[0]«s [0] + (d[0]»2) ; 

d_tmpO = d[0] ; 
s_tmpQ = s [1] ; 

for(ijL=»l; ii<31; ii++) { 

ci[ii] «(x[128*ii+64]«l) - s_tmpO - x [128* (Li+l) ] ; 

s[il]= s_tmpO+( (d_tmpO' + d[ii])»3); 

d^tmpO = d[ii]; 

s_tiapO = s [ii] ; 

x[ii*64]=s[ii]; 

x[.(i±+32)*64]=d[ii]; 

) 

d[31]«- (d[31]«l) -(s[31]«l); 
s [31] «s [31] + ( (d[30] +d[31] )»3) ; 

x[0]=s[O]; 
x[2q48 ]=d[0] ; 
x[1984 ]=s[31] ; 
x[4032]=d[31] ; 



for (i=0 ;i<2048;i+=64) { /*- nt = 32 */ 

X = &lnt_data [i] ; 
niid=15 ; 

s[0] - x[0]; 
d[0] = x[l]; 
s[l] = x[2]; 
s[15] = x[30]; 
dtl5] - x[31]; 

d[0]=(ci[O]«l)-s[O] -s[l] ; 
s[0]-s [0] + (d[0]»2) ;. 

d^trapO = d[0.] ; 
s_tmpO " s[l]/ 
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f02:(ii=l; ii<15; ii++ ) { 

ci[ii] =( (x[2*ii+l] ) «l) - s__tmpO - x[2*i.i+2]; 

s[ii]^ s_tmpO+ ( (d_trnpO + ci[ii])»3); 

ci_tmpO = d[ii] ; 

s_tinpO = s[ii]; 

x[ii]=s[ii]; 

3c[ii+16]=d[ii]; 

} ' 

d[15]-(cl[15]-s[15])«a; 
• s [jL5]=s[15] + ( {d[14]+ci .[15])»3); 

x[O]=s[0]; 
x[16]=d[0]; 
x.[13.1.«s..[15]; 
x[31]= d[15]; 



for (i=0;i<32;i++) { 

X = &int_data [i] ; 
m±ci-15; 

s[0] = x[0]; 
d[0] = x[64]; 
s[a] = x[128]; 
s[a.5] = x[192d]; 
d[l."5] « x[1984]; 

d[O] = (d[0]«l)-st0]-s CI] ; • 
s[O]=s[0] + (d[0]»2); 

d_t:mpO « d[0] ; 
s_tmpO = s(l] ; 

foxr(ii=l; ii<15; ii++) { 

cl[ii] =(x[128*ii+64] «1) - s_tmpO - x[12 S* (ii+1) ] ; 

s[ii]= s_trapO+ ( (d^tittpO + d[ii])»3); 

ci^tmpO = d[ii] ; 

s_tmpO = s[ii] ; ■ 

x[ii*64]=s[ii] ; 

x[ (ii+16)*64]=d[ii] ; 

} - 

d[15-] = (d[15]<<l) -(sti5]«l); 
s[15]=s[15] + ( {d[14]+d[ 15] )>>3) ; 

x[O]=s[0]; 
x[X024]=d[0] ; 
x[960]=s[15]; 
' x[r984]=d[15]; 

1 

for (i=0;i<1024;i+=64) { /* nt = 16 */ 

X = &int_idata[i] ; 
mici=7 ; 

s[0]- = x[0]; 
d[0] = x[l]; 
3[1] = x[2]; 
s[7] = x[14]; 
d[7] = x[15]; 
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ci[0] = {d[0i«l)-s[O]-s[l]; 
s[0]=s[0] + (dtO]»2); 

d^tmpO = cl[0]; 
s^tiqpO = 3[1] ; 

• for(ii==l; ii<7; i±++) { 

d[ii] -( (x[2*iH-l] )«1) - s^tmpO - xi [2*ii+2] ; 
s[ii]= s_tmpO+( (d_tmpO + d[ii])»3); 
d^tmpO = d[ii]; 
s^tmpO = s [ii] ; 
* x[ii]=s[ii]; 
x[ii+8]=d[ii]; 

} 

d[7] = (d[7]-s[7])«:l; 
s[7]=s[7] + ( (d[6]+ci[7])»3); 
x[0]=s[0]; 
■x[8]-d[0]; 
x[7]=s[7]; 
x[15]=: d[7]; 

} 

fox {i=0;i<16;i++) { 

X « &int_data[i] ; 
inid»7; 

's[0] = x[0]; 

. ci[0] = x[64] ; 

s[l] « x[128] ; 

s[7] « x[896]; 

' <i[7] = xt960]'; 

cl[0] = (d[0]«l)-s[O]-s[l]; 
.s[0]=s[0] + {d[6]»a) ; 

ci_tmpO == d[0] ; 
s_tmpO = s [1] ; * 

for{ii==l; ii<7; { 

d[ii] =(x[128*id.'+64]«l) - s_tmpO - sc[128*(iiH 

s[ii]- s_tmpO+ ( ( d_tinpO + d[ii])»3);. 

d_titipO = d[ii]; 

s_tmpO = s [ii] ; 

x[ii*64]=s[ii]; 

x[ (ii+8)*64I-d[d_i] ; 

} 

ci[7] = (d[7]«l) -(s [7]«1); 
s[7]-s[7] + ((d[6)+cit7])»3); 

x[0]=s[0]; 
x[512]=d[0]; 
xt448]=s[7]; 
3c[960]-d[7]; 



In the parameter table, the vector length is the only value that change. We give it for the first two 
loops. To deduce the table for the other loops, the vector len.gth has to be set to 14 and 6 respectively. 
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Parameter 


Viiliip 


Vector length 


JU 


Reused data set size 




I/OIRAMs 


o 


ALU 


o 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


6 


Configuration cycles 


6+30=3S 



5.6.3 Optimizing the Inner Loops 

The efforts are then concentrated on the six inner loops. In fact, if we look at them, they all need i 
input data and output 4 data. 2 more data arc needed for the first iterationL. Hence we need at most S 
IRAMs for the fu^ iteration and 6 for the othiers. This means that we can unroll tiie loops twice^ 
needing 14 IRAMs for one iteration of the new loop bodies. Below we present only the unrolled inner- 
loops for commodity reasons. 

First loop: 

for(ii=l; ii<31; ±i=ii+2) { 

d[ii] =( (x[2*ij.+l] )«1) - 3_tmpO - x[2*ii+2]; 
s[ii] = s_tmpOH-( (d_tmpO + d[ii])>.>3); 
d_tinpO = d[ii] j 
' s_tmpO s[ii] ^ 
x[ii+l] = stii] ; 
x[ii+33]=d[ii] ; 

d[ii+l] =({x[2-* (ii+l)+l])«l) - s_tinpO - x[2*(ii+l)+2 ]; 

s[ii+l] = s_tinEDO+( (d_tmpO + d[ii+a.] ) »3) ; 

d^tmpO = d[ii+l] ; 

s_tnipO = stii+l]; 

x[ii+l] = s[ii-i-l]; - 

x[ii+33] = d[i±+l]; 

) 

Second loop: 

f6r(ii=l; ii<31; ii«ii+2) { 

d[ii] « (x[128*ii+64]«l) - s^tmpO - x[128* (ii+l) ] ; 

s[ii] = s_tmpO-i- ( {d_tinpO + d[ii])»3); 

d^tmpO = dtii] P 

s_tinpO = s [ii] p 

x[ii*64] = s[i±]; 

x[(ii+32)*64] d[ii]; 

d[ii+l] =(x(12a*(ii+l)+64]«l) - s_tmpO - x [128* (ii+2 ) ] ; 

s[ii+l] = s_tmpO+( (d_tmpO + d[ii+l])»3); 

d_tmpO = dtii+1]*; 

s_tmpO = s[ii+l.] ; 

x[(ii+l)*64] « s[ii+l]; 

x[ (ii+33)*64] « d[ii+l]; 

} 
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Third loop: 

for{ii=l; ii<15 ; ii-ii+2){ 

d[ii] « ({X [2*ii+l])«l) - s^tmpO - x[2*ii+2]; 
s[ii] = s_tinpO+ ( (d^tmpO + d[ii] ) »3) ; 
d_tmpO = d[i.i] ; 
s_tmpO = s [ di] ; 
x[ii] =» s[i±] ; 
x[ii+16] = ci[ii]; 

d[ii+l] = ( (x[2*(ii+l)+l])«l) - s_tmpO - x[2* (ii+L) +2] ; 

s[ii+l] « s_tmpO+ ( (d_tmpO + d[id-+l] ) »3) ; 

d_tmpO = d[ii+l] ; 

s_tmpO = s [ ±i+l] ; 

x[ii+l] « s [ii+l]; 

xtii+17] = <i[ii+l'; 



Fourth loop: 

for(ii=l; ii<15 ; ii=ii+2) { 

d[ii] = (x[ 128*ii+64]«l) - s__trcxpO - x [128* (ii+1) ] ; 
s[ii] » s_tinpO+( (d_tmpO + d[ii]) »3) ; 
d^tmpO = d[li]; 
s_tmpO = s [ ii] ; 
. xtii*64] - s[ii] ; 

x[ (ii+16)*64] = d[ii]; 

d[ii+l] = ( x[128* (ii+l)+64]<<l) - s_tmpO - x [128* (i-i+2) J ; 

s[ii] = s_t3np0+( (d_tmpO + d[ii+L])»3); 

d^tmpO = d[ii+l]; 

s^tmpO =' s C ii+1] ; 

x[ (ii+l)*64 ] = s[ii+l] ; 

x[(ii+17)*64] = d[ii+l]; 



Fifth loop: 

for(ii-l; ii<7; ii-ii+2) { 

d[ii] = ( (x:[2*ii+l] )«1) - s_tmp)0 - •x[2*ii+2] ; 
s [ii] = s_tmpO+({d_tmpO + d[ii]> »3) ; 
d_tmpO » dCii] ;• 
s^tmpO = s Lii] ; 
x[ii] = s [d-i]-; 
x[ii+8] F ci[ii]; 

d[ii+l] = ( (x[2*(ii+l)+l] )«1) — s__tinpO - x[2* 
s[ii+l] = s_tmpO+ { {d_.tmpO + d[il-+l] ) »3) ; 
d_tinpO-= dCii+1] ; 
s^tmpO = s Cii+1] ; 
x[ii+l] = s[ii+l]; 
x[ii+9] = ci[ii+l]; 
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Sixdi loop: 

for(ii=l; ±i<7; ii«ii+2) { 

d[ii] = (x[128*ii+64]«l) - s_tmpO - x[128* {ii4. 1) ] ; 

s[ii] = s_tmpO+ ( (d_tinpO + d[ii])»3); 

d_tmpO = d[ii] ; 

s_tmpO =5 s[ii] ; 

x[ii*64] = s[ii]; 

x[{ii+8)*64] = d[ii]; 

dCii+l] = (x[128*(ii+l)+64]«l) - s^tmpO - x[12 {ii+2) ] ; 

s[ii] == s_tmpO+{ (d^tmpO + d[ii+l])»3); 

d_tmp.O == d[ii+l]; 

s_tmpO == s [ii+1] 

x[(ii+l)*64] = s[ii+l]; 

x[(ii+9)*64] = d[ii+l]; 



We obtain the following dataflow graph of these loop bodies after a step of tree balancing has been 
performed. We represent here only the graph corresponding to die fix^ loop. To obtain the graphs for 
the other loops, only the mput and output data need to be changed 






d(H-1) 









Each input and output data will occupy an IRAM. dO and sO will be the only values in their IRAM, 
enabling then the merge operations to select between ^0, resp. sO at the first iteration and the feedback 
values for the other iterations. Once the pipeline is filled, 8 values can be output in a cy<;le, corre- 
sponding to 4 values for array x. The same configuration is used for all loops; only the data in the 
IRAMs diflfer. We give now result tables only for the 2 first loops. The other tables are the saane. 

For the first two loops we obtain the following table, and the expected speedup with respect to a stan- 
dard superscalar processor with 2 instructions issued per cycle is L 5.3 . 
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JParameter 


Value 


Vector lengdi 


30 


Reused data set size 




I/OIRAMs 


14 


ALU 


12 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


10 


Configuration cycles. ' ' 


10+15=25 




Ops 


Number 


LD/ST(2cycIes3 


14 


ADDRCOMP(l cycle) - 


2 


ADD/SUB (1 cycle) . 


.17 


MUL (2. cycles) " 


0 


SHIFT (1 cycle) 


4 


Cycles per iteration 


51 


Cycles needed for the loop (2-way) 


(5.1*15)/2=383 
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The limitations of conventional processors are 
becoming more and more evident The grow- 
ing importance of stream-based applications 
makes coarse-grain dynamically reconfigu- 
rable architectures an attractive alternative [3}, 
[4], [6], [7]. They coxnbine the performance of 
ASICs, which are very risky and expensive 
(development and mask costs), with the flexi- 
bility of traditional processors [5]. 
In spite of the possibilities we have today in 
VLSI development^ the basic concepts of mi- 
croprocessor architectures are the same as 20 
years ago. The main processing unit of modem 
conventional microprocessors, the datapath, in 
its actual structure follows die same style 
guidelines as its predecessors. Although tiie 
development of pipelined architectures or su- 
perscalar concepts in. combination with data 
and instruction caches increases the perform- 
ance of a modem microprocessor and allows 
higher fiiequency rates, the main concept of a 
static datapath remains. Therefore, each opera- 
tion is a composition of basic instructions that 
the used processor orvms. The benefit of the 
processor concept lays in the ability of execut- 
ing strong control dominant application. Data 
or stream oriented applications are not well 
suited for this environment. The sequential 
instruction execution isn't the right target for 
that kind of applications and needs high band- 
width because of permanent retransmitting of 
instruction/data from and to memory. This 
handicap is often eased by using of caches in 
various stages^ A sequential interconnection of 
filters, which do the according data manipulat- 
ing without writing back the intermediate re- 
sults would get the right optimisation and re- 
duction of bandwidth. Practically, this kind of 
chain of filters should be constructed in a logi- 
cal way and configured during runtime. Exist- 
ing approach^to extend instruction stots use 
static modules, not moditlable during runtime. 
Custoniized microprocessors or ASICs are 
optimized for one special application environ- 
ment. It is nearly impossible to use the same 
microprocessor core for another application 
without loosing the performance gain of this 
architecture. 



A new approach of a flexible and high per- 
formance datapath concept is needed, which 
allows to reconfigure the functionality and 
make this core maoily application independent 
without losing the performance needed for 
stream-based applications. 
This contribution introduces a new concept or 
loosely coupled unplementation of the dynamic 
reconfigurable XPP architecture from PACT 
Corp. into a static datapath of the SPARC com- 
patible LEON processor. Thus, this approach is 
different fix>m those^ where the XPP operates as 
a completely sepauate (master) component 
within one Configurable System-on-Chip 
(CsoC), together with a processor core, 
global/local memory topologies and efficient 
multi-layer Amba-bus interfaces [11]. Here, 
from the prograninxers point of view the ex- 
tended and adapted datapath seems like a dy- 
namic configurable instruction set. It can be 
customized for a specific application and accel- 
erate the executiori enormously. Therefore, the 
programmer has to create a number of configu- 
rations, which can lie uploaded to the XPP- 
Array at run time, e.g. this configuration can be 
used like a filter to calculate stream-oriented 
data. It is also possfole, to configure more than 
one function in the same time and use them 
simultaneously. This concept promises an - 
enormously pexformance boost and the needed 
flexibility and power reduction to perform a 
series of applications very effective. 

6.2 1. LEON RISC Mi- 
croprocessor 

For implementation of this concept we chose 
the 32-bit SPARC V8 compatible microproces- 
sor [I] [2], LEON. TTiis microprocessor is a 
synthesisable, free available VHDL model 
which has a load/stoxe architecture and has a 
five stages pipeline icnplementation with seper- 
ated instruction and data caches. 
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As shown in Figure 66 the LEON is provided 
with a full implementation of AMBA 2.0 AHIB 
and APB on-chip bus, a hardware multipli-er 
and devider, prognammable 8/16/32-bit mem- 
ory controller for external PROM, static RAiM 
and SDRAM and several on-chip peripheia-ls 
such as timers, UARTs, interrupt controller and 
a 16-bit I/O port A simple power down mocde 
is implemented as well. 



Figure 66: LEON Architecture Overview 



LEON is developed by the European Space 
Agency (ESA) for future space missions. Une 
performance of LEON is close to an ARM9 
series but don't have a memory manageme^it 
unit (MMU) implementation, which limits tlie 
use to single mem^ory space applications. Hn 
Figure 67 the datapath of the LEON integ-er 
unit is shown. 




Figure 67: LEON' Pipelined Datapath Structure 
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The XPP architecture [6], [7], [8] is based on a 
hierarchical array of coarse-grain, adaptive 
computing elements called Processing Array 
Elements (P/4Es) and a packet-oriented com- 
munication network. The strength of Hie XPP 
technology originates from the combination of 
array processing with unique, powerful run- 
time reconfiguration mechanisms. Sinoe con- 
figuration control is distributed over a Configu- 
ration Manager (CAi) embedded in the array, 
PAEs can be configured rapidly in parallel 
while neighboring PAEs are processir^g data. 
Entire applications can be configured and run 
independently on different parts of the array. 
Reconfiguration is triggered externally or even 
by special evont signals originating witiiin the 
array, enablirag self-reconfiguring designs. By 
utilizing protocols implemented in hardware, 
data and ev^nt packets are used to process, 
genemte, dec^ompose and merge streams of 
data. 

The XPP hmjs some similarities witti other 
coarse-grain a-econfigurable architectur-es like 
the KressArrary [3] or Raw Machines [4] - which 
are specifically designed for stream-based ap- 
plications. XPP's main distinguishing features 
are its automatic packet-handling mechianisms 
and its sophisticated hierarchical configuration 
protocols for runtime- and self-reconfiguiration. 

6.2.1 2.1 Array Structure 

A CM consists of a state machine and internal 
RAM for configuration caching. The PAC itself 
(see top right- liand side of Figure 69) contains a 
configuration bus which connects the CM with 
PAEs and othier configurable objects. Horizon- 
tal busses carry data and events. They can be 
segmented by configurable switch-objects, and 
connected to PAEs and special I/O objects at 
the periphery of the device. 
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A PAE is a collection of PAE objects. The 
typical PAE shown in Figure 69 (botrtom) con- 
tains a BREG object (back registers) and an 
FREG object (forward registers) which are used 
for vertical routing, as well as an AXU object 
which performs the actual computations. Hie 
ALU performs common fixed-poiat arithmeti- 
cal and logical operations as well as several 
special threeinput opcodes like multiply-add, 
sort, and counters. Events generated by ALU 
objects depend on ALU results or exceptions, 
very similar to the state flags of a classical mi- 
croprocessor. A counter, e.g., generates a spe- 
cial event only after it has tenninated. The next 
section explains how these events .are used. 
Another PAE object implemented in t:he XPP is 
a memory object which can be used in FIFO 
mode or as RAM for lookup tables, intermedi- 
ate results etc. However, any PAE object func- 
tionality can be included in the XPP architec- 
ture. 



PAE objects as defined above conmiunicates 
via a pa.cket-oriented network. Two types of 
packets are sent through the airayr data packets 
and eveni: packets. Data packets have a uniform 
bit width, specific to the device type. In normal 
operatiom mode, PAE objects are selfsynchro- 
nizing. Asi operation is performed as soon as all 
necessary data input packets are a.vailable. The 
results aire forwarded as soon as they are avail- 
able, provided the previous results have been 
consumed. Thus it is possible to rnap a signal- 
flow gr^ph direcdy to ALU objects. Event 
packets are one bit wide. They tnransmit state 
information which controls ALU execution and. 
packet generation. 

S2A 2.3 Configuration 



Figure 69: Structure of an XPP device 



6.2.2 

6.2.3 2.2 Packet Handling and 
Synchronization 



Every PAE stores locally its current configura- 
tion stat^, i.e. if it is part of a corifiguration or 
not (stat&s „coiifigured" or ,^:ee'*). Once a PAE 
is configrured, it changes its state to „config- 
ured**. Ttiis prevents the CM from reconfigur- 
ing a PAJE which is still used by another appli- 
cation, Tlie CM caches the configuration data 
in its int:emal RAM until the required PAEs 
become a.vailable. 
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While loading a configuratioa^ all PAEs start to 
compute their part of the application as soon as 
they are in state ,jConfigured"- Partially config- 
ured q)plications are able to process data with- 
out loss of packets. This concurrency of con- 
figuration and computation hides configuration 
latency. 

6»2.5 2.4 XPP Application Map- 
ping 

The Native Mapping Language (NML), a 
PACT proprietaiy structural language with 
reconfiguraton primitives, was developed by 
PACT to map applications to tiie XPP array. It 
gives the programmer direct access to all tin- 
ware features. 

In INML, configurations consist of modules 
which are specified, as in a structural hardware 
description language, similar to, for instance, 
structural VHDL, PAE objects are explicitly 
allocated, optionally placed, and their connec- 
tions specified. Hierarchical modules allow 
component reuse, especially for repetitive lay- 
outs. Additionally, NML includes statements to 
support configuration handling. A complete 
NMDL application program consists of one or 
more modules, a sequence of initially config- 
ured modules, differential ckanges, and state- 
ments which map event signals to configuration 
and prefetch requests. Thus configuration han- 
dling is an explicit part of the application pro- 
gram. 

A complete XPP Development Suite (XDS) is 
available from PACT. For more details on 
XPP-based architectures and development tools 
see [6]. 



The system is designed to offer a maximum of 
performance. LEON and ZXPP should be able to 
communicate with each other in a simple and 
fciigh performance mannei:. While the XPP is a 
dataflow orientated device, the LEON is a gen- 
eral purpose processor, suitable for handling 
control flow [1], [2]. Thferefore, LEON is used 
for system control. To do this, the XPP is inte- 
grated into the datapath of the LEON integer 
unit, which is able to control the XPP. 




6.3 3. LEON Instruction 
Datapath Extension 
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Figure 71: Extevided DcAopath Overview 



Due to unpredictable operation time of the XPP 
algorithm, integration of XPP into LEON data- 
path is done in a looseljr-coupled way (Figure 
71). Thus the XPP array can operate mdepend- 
ent from the LEON, which is able to control 
and reconfigure the XPP during runtime. Since 
the configuration of XPP is handled by LEON, 
the CM of the XPP is unnecessary and can be 
left out of the XPP array. The configuration 
codes are stored in the LEON RAM. LEON 
transfers the needed configuration from its 
system RAM into the XPP and creates the 
needed algorithm on the airay. 
To enable a maximum of independence of XPP 
from LEON, all ports of the XPP - input ports as 
well as output ports - are bixfPered using dual clock 
FIFOs. Dual-clocked FIFOs are implemented into 
the lO-Ports between LEOM" and XPP. To transmit 
data to the extended XPP-based datapath the data 
are passed through an lO-Port as shown in Figure 5. 
In addition to the FIFO the lO-Ports contain logic to 
generate handshake signals and an interrupt request 
signal. The lO-Port for receiving data from XPP is 
similar to Figure 5 except that fte reversed direction 
of the data signals. This enab>les that XPP can work 
completely independent from LEON as long as Aere 
are input data available in the input port FIFOs and 
free space for result data in the output port FIFOs. 
There are a number of additionally features imple- 
mented in the LEON pipeline to control the data 
transfer between LEON and XPP. 




When LEON tries to write to an lO-Port con- 
taining a full FIFO or read from an lO-Port 
containing an empty FEFO a tmp is generated. 
This trap can be handled through a trap han- 
dler. There is a further mechanism - pipeline- 
holding - implemented, -to allow LEON holding 
the pipeline and wait for free FIFO space dur- 
ing XPP write access ir^spectively wait for a 
valid FIFO value during XPP read access. 
When using pipeline-holding the software de- 
veloper has to avoid reading from an lO-Port 
with empty FIFO while? the XPP, respectively 
the XPP input lO-Ports, contains no data to 
produce ou^uts. In this case a deadlock will 
occur and the complete system has. to be re- 
seted. 

XPP can generate interrupts for the LEON 
when trying to read a value from an empty 
FIFO port or to write a, value to a full FIFO 
port. The occurrence cif interrupts indicates, 
that the XPP array cannot process the next step 
because it has eidier no input values or it can- 
not output the result yalxie. The interrupts gen- 
erated by the XPP are maskable. 
The interface provides information about the 
FIFOs. LEON can read the number of valid 
values the FIFO contains. 



Figure 72: LEON-to-XPR dual-clock FIFO 
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The interface to the XPP appears to the LEONT 
as a set of special registers. (Figure 6). These 
XPP registers can be categorized in communi- 
cation registers and status registers. 



TTiere are a number of XPP status register im- 
plemented to contro 1 the behavior and get statois 
information of the interface. Switching betwe^en 
the usage of trap haoidling and pipeline holdisig 




Figure 71: Extended L£ON Instruction Pipeline 



For data exchange the XPP communication 
registers are used. Since XPP provides three 
different types of communicaition ports, there 
are also three types of communication registes, 
whereas every type is splitted into an input part 
and an output part: 

The data for die process are accessed through 
XPP data registers. The munber of data input 
and data output ports as well as the data bit- 
width depends on the implemented XPP array. 
XPP can generate and consume events. Events 
are one bit signals; The number of input events 
and output events depends on the implemented 
XPP array again. 

Configuration of the XPP is done through die 
XPP configuration register. LEON reads the 
required configuration value from a file -r 
stored in his system RAM - and writes it to the 
XPP configuration register. 
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can be done in the hold register. A XPP clock 
register contains a clock frequency ratio be- 
tween LEON and XPP. By writing this register 
LEON software can set the XPP clock relative 
to LEON clock. This allows to adapt the XPP 
clock frequency to the required XPP perform- 
ance and consequently to influence the power 
consumption of the system. Writing zero to the 
XPP clock register turns off the XPP. At. last 
there is a status register for every FIFO con- 
taining the number of valid values actually 
available in the FIFO. 

This status registers provides a maximum of 
febdbility in communication between LEON 
and XPP and enables different communication 
modes: 

If there is only one application running 
on the system, at the time, software may be 
developed in pipeline-hold mode. Here 
LEON initiates data-read-orwite-ftom re- 
spectively to IXPP. If there is no value to 
read respectively no value to write, LEON 
pipelme will be stopped until read or write 
is possible. This can be used to reduce 
power consumption of the LEON part. 

In interrupt mode, XPP can influence 
the LEON program flow. Thus, the lO- 
Ports generates an interrupt depending on 
the actual nunxber of values available in the 
FIFO. The corximunication between LEON 
and XPP as done in interrupt service rou- 
tines. 



PoUmg mode is the classical way to 
access the >CPP. Initiated by a timer- event 
LEON reads all XPP ports containing data 
and writes sail XPP ports containing free 
FIFO space. Between these phases L-EON 
can compute other calculations. 
It is anytime possible to switch between this 
strategies within one application. 
The XPP is delivered containing a confi^ra- 
tion manager to handle configuration an<l re- 
configuration or the array. In this concept the 
configuration manager is dispensable because 
the configurations as well as any reconfiguration 
is controlled by- the LEON through the XPP 
configuration register. All XPP configma-tions 
used for an application are stored in the 
LEON'S system HAM. 

6.4 4. Tool and Com- 
piler Integration 
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The LEON'S SPARC 8 instruction set [1] was 
extended by a new subset of instructions to 
make the new XPP registers accessable through 
software. These instructions are based in the 
SPARC instruction format but they are not 
conform to the SPARC^VS'Tstaiidard. Corre- 
sponding to the SPARC conventions of a 
load/store Axchitecture the instniction subset 



can be splitted in two general types. Load/store 
instructions can exchange data between the 
LEON memory and the XPP communication 
registers. TTie number of cycles per instruction 
are similar to the standard load/st:ore instruc- 
tions of the LEON. Read/write instructions are 
used for communications between LEON reg- 
isters. Sm&e the LEON register-set is extended 
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Table 2 Pefformance on IDCT (8x8) 
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by tile XPP registers the readAvrite instructions 
are extended also to access XPP registers. 
Status registers can only be accessed with 
read/write instructions. Execution of arithmetic 
instructions on XPP registers is not possible. 
Values have to be written to standard LEON 
registers before they can be target of arithmetic 
operations. 

Tlie complete system can still operate any 
SPARC V8 compatiple code. Doing this, the . 
XPP is completely unused. 

The LEON is provided with the LECCS cross 
compiler system [9] standing under the terms of 
LGPL. This system consists of modified ver- 
sions of the binutils 2.11 and gcc 2.95.2. To 
make the new instruction subset available to 
software developers, the assembler of the binu-. 
tils has been extended by a number of instruc-^ 
tions according to the implemenlred instruction 
subset. The new instructions have the same 
nmemonic as the regular SPARC V8 load, 
store, read and write instructions. Only the new 
XPP registers have to be used as source respec- 
tively target operand. Since the modifications 
of LECCS are straightforward extensions, the 
cross compiler system is backward compatible 
to the original version. The availability of the 
source code of LECCS has allowed to extend, 
the tools by the new XPP operations in the 
desmbedway. 



The development of the XPP algorithms have 
to h& done with separate tools, provided by 
PACT Corp. 

6.5 5. Application Re- 
sults 

As a first analysis application a inverse DCT 
applied to 8x8 pixel block was implemented. 
For aJl simulations we used 250 MHz clock 
frequency for LEON processor- and 50 MHZ 
clock frequency for XPP. The usage of XPP 
accelerates the computation of tfcie IDCT about 
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factor four, depending on the communication 
mode. However XPP has to be configured be- 
fore computing the IDCT on it. Table 1 also 
shows the configuration time for this algorithm. 
As shown in 
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Figure 74,. the benefit brought by XPP rises 
with the number of IDCT blocks computed by 
it before reconfiguration, so the number of re- 
configurations during complex algorithms 
should be minimised. 



A first complex applicatioiK implemented on the 
system is MPEG-4 decodhng. The optimization 
of the algorithm partitioning on LEON and 
XPP is still under construc-tion. In Figure 8 the 
blockdiagram of the MPErG-4 decoding algo- 
rithm is shown. Frames w^ith 320 x 240 pixel 
was decoded. LEON by using SPARC V8 stan- 
dard instructions" decodes one fi^ime in 23,46 
seconds. In a first implementation of MPEG-4 
tising the XPP, only the lEDCT is computed by 
XPP, the rest of the MPEa-4 decoding is still 
done with LEON. Now, with the help of XPP, 
one frame is decoded in 1Z,98 s. This is a per- 
formance boost of more tlien twenty percent. 
Since the XPP perfonrianc« gain by accelerat- 
ing the iDCT algorithm only is very low in the 
moment, we work on XPP implementations of 
Pluffinaim-decoding, dequ^ntisation and pre- 
diction-decoding. So the performance boost of 
this concept against the standalone LEON will 
be increased. 



6.6 6. Conclusion 



wo 2004/01556^^ PCT/EP2003/008080 
S 150 M 

References 



Today, the instruction datapaths of modem 
microprocessors reach their limits by using 
static instmction sets, driven by the traditional 
von Neumann or Harvard architectural princi- 
ples. A way out of these limitations is a dy- 
namic reconfigurable processor datapath exten-. 
sion achieved by integrating traditional static 
datapaths with the coarse-grain dynamic recon- 
figurable XPP-architeoture (eXtreme Process- 
ing Platform). Therefore, a loosely asynchro- 
nous coupling mechanism of the given instruc- 
tion datapath has been developed and integrated 
onto a CMOS 0.13 ]im standard ceU technology 
from UMC. Here, the SPARC compatible 
LEON RISC processor is used, whereas its 
static pipelined instruction datapath has been 
extended to be configured and personalized for 
specific applications. This compiler-compatible 
instruction set extension allows, a various and 
efRcient use, e.g. in streaming application do- 
mains like MPEG-4, digital filters, mobile 
communication modulation, etc. The intro- 
duced coupling -technique by flexible dual- 
clock FIFO interfaces allows asynchronous 
concurrency and adapting the frequency of the 
configured XPP datapath dependent on actual 
performance requirements, e.g. for avoiding 
nnneeded cycles and reducing power consump- 
tion. 



As represented above, the introduced concept 
combines the flexibility of a general purpose 
microprocesser witbi the performance and 
power consumption of coarse-grain reconfigu- 
rable datapath structives, nearly comparable to* 
ASIC performance. Here, two programming 
and computing paradligms (control-driyen von 
Neumann and transtport-triggered XPP) are 
unified within one hy^brid architecture with the 
option of two clock . domains. The ability to 
reconfigure the transport-triggered XPP makes 
the system independ&nt from standards or spe- 
cific applications. UlIs concept opens potenial 
to develop multi-standard communication de- 
vices like software Kadios by using one ex- 
tended processor architecture with adapted 
programming and conapilation tools. Thus, new 
.standards can be ea&ily implemented through 
software updates. Th& system is scalable during ' 
design time through the scalable array-structure 
of the used XPP extension. This extends the 
range of suitable ^plications from products 
with less multimadia Amotions to complex high 
performance systems. 
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Second Major part 

Another aspect of Hie present invention ^wiU be described in the foUovf^inLg. This aspect deals wi& problems 
relating to the implementation of hyper-'threading, multi-threading, multi-tasking, scheduling and e>ceciition 
of parts of configurations and so fortti. 

It is noted that WO 00/49496 aheady discloses a method for execution of a computer program, usinLg a 
processor comprising a configural functional unit capable of executing ireconfigurable instmctioas Ln fact of 
which can redefined at runtime. 

A problem ixx conventionable processor architectures exists if a coupling of for example sequentional proc- 
essors is needed and/or technologies such as a data-streaming, hyper-threading, multi-direading andl so 
forth shall be used in a use&l way enhaxncing performance. Techniques Icnown in prior art, such as 
02/50665 Al, do not allow for a sufficiently efQcient ws^ of effecting data exchange between the AkLU of a 
CPU and the configurable data processiing logic cell field, be it an FPG A., DSP or the like, that data ex- 
change being effected via registers in the prior art In other words, it is xmecessary to first sequentially write 
data iato a register and thra retrieve them sequentially and restore them • sequentiaUy as well. 

Another problem exists, if an external access to data is requested in kno^vm devices such as those cit:ed used 
interalia to implement functions in the configurable data processing logic ceil field, DFP, FPGA or ihe like, 
that can not be processed sufficiently on. the CPU-integrated ALU. Accordingly, the data processing logic 
cell field is practically used to allow for user-defined opcodes that can process data more efficiently^ than is 
possible on the ALU of the CPU withoixt further support by the data processing logic cell field, hi tlie prior 
ait, the coupling is generally word-based, not block-based. It is a fiirtheor important aspect of the pr&sent 
invention that is has been realized that for data-streaming data-processing block-based coupling is h^ighly 
preferable. At aiiy rate, a more efiicient -data processing, in particular more efficient than possible with a 
close coupling via registers, is highly preferable. 

Another method for the use of logic cell fields consis.ting of coarse- andL/or fme-granular logic cells and 
logic cell elenients consists in a very loose coupling of such a field to a conventional' CPU and/or a CPU- 
core in embeded systems. Here, a conventional sequential program can "be executed on the CPU, fox- exam- 
ple a prc^gram written in C, C++ or the like, wherein the instantiation or the data stream processing by the 
fine- and/or coarse-granular data processing logic cell field is effected v^ia that sequential program. Then, 
the problem exists that in programming said logic cell field a program irot written in C or any other se- 
quential high-level language must be provided for the data stream processing. It would be preferable here to 
allow for C-programs to run both on a conventional CPU-architecture as well as on the data processing 
logic cell field operated therewith; in paurticular despite the fact that a qiaasi-sequential.program execution 
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should maintain the capability of data-streaming in the data processing logic cell fields whereas simultane- 
ously the capability exists to operate the CPU in a not too loosely coupled way. 

It is already known to provide for sequential data processing wrdiin a data processing logic cell field, see 
for example DE 196 51 075, WO 98i^6356. DE 196 54 846, WO 98/29952, DE 197 04 728, WO 98/35299, 
DE 1 99 26 538; WO 00/77652, DE 102 12 621. Here, partial ex:ecution is achieved within a single configu- 
ration, for example to reduce the amount of resources needed, optimize the time of execution atad so fordi; 
however, this does not lead automatically to allowing a progrannmer to translate or transfer higl».-level lan- 
guage code automatically onto a data processing logic cell field as is the case in common machijie modells • 
for sequential processes. The compilation, transfer or translatioo of a high-level language code onto data 
processing logic cell fields according to the methods known for modells of sequentially executing machines 
doesr remaiirdifBcult in fact 

hi the prior art, it is further known that configurations that effect different functions oh parts of tiie area 
respectively can be simultaneously executed on the processing array and that a change of one or* some of 
the configuration(s) without disturbing other configurations is p ossible at nm-time. 
Methods and hardware-implemented means for the implementafnon are known to ensure that the execution 
of partial configurations to be loaded onto the array is possible without deadlock. Reference is being made 
to DE 196 54 593, WO 98/31102, IDE 198 07 872, WO 99/44147, DE 199 26 33S, WO 00/7765:2, DE 100 
28 397, WO 02/13000. This technology allows in a certain way a certain parallelism and, given certain 
forms- and interrelations of that configurations or partial configiuations for a certain way of midt:i- 
tasking/multi-threading, in particular in such a way ibat the planoimg, that is. the schedulmg andAor the plan- 
ning control for time use; can be provided for. Furdiermore, from flie prior art, time use planning control 
means and -methods are known per se, diat, at least und^ a corresponding mterrelation of confL£uratiQn3 
and/or assigonient of configurations to certain tasks and/or threa^ds to configurations and/or .seqix-ences of 
configurations allow for a multi-tasking and/or multi-threading. The use of such time use plannLang control 
means used in the prior art for configuring and management of configurations for the piupose oC" scheduling 
of tasks, threads, muW- and hyper-threads is considered to be incentive per se. 

In at least a partial aspect of preferred variants^ it is preferable to provide for support of modem technblo* 
gies of data processing and prograca execution such as multi-tasldng, multi-threading, hyper-threading, and 
sofonh.' . ' * 

Another important aspect of the present invention can be seen iti. that data are inputted into that c3ata proc- 
essing logic cell fields in response to the execution of a load coniiguration by that data processin^g logic cell 
fields and/or m that data are stored, fi-bm that data processing logic cell fields by executing a store- 
configuiation. Accordingly, it is preferred to provide the load- and/or store-configurations in suc^h a way 
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that the adresses of those memoty cells used are directly^ or indirectly generated within the data processing 
logic cell fields, Ifae adresses indicacdng those memoiy ceUs and/or locations to which sn access has to be 
effected as load- and/or store-access, that is as a read- and/or write-access. By conflguxing adress genera- 
tors within the configuration it becomes possible to load a plurality of data into fte datia processing logic 
cell fields where they can be stored in internal RAMs (IRAMs) and/or within the intmnal cells such as 
E ALUs having registers and/or in other dedicated memoxy and/or storage means. The load- or store- 
configuration respectively thus allows for a block-wise and thus almost data-stream-likce loading and stor- 
ing of data, this being in particular much faster than a siogle acces and can be executed prior to or during 
the execution of one or more data processing - and/or data handling in a data altering manner - configura- 
tions processing that preloaded data. 

The data loading can take place, provided that tiiat logic cell fields are, as is typically tihe case, sufficiently 
laige, in smaU partial areas thereol^ while other partial areas are executing oth^ For example, indie 
ping-pbng-like. data processing described in other published documents by the present ^[yplicant, diat 
known ping-pong-like data processing relying on memory cells provided on each side of that data process- 
ing field, where data in a first data processing step stream fiom the memory on one side through the data 
processing field to the memory on the other side of that data processing field, where thLey are stored as 
intermediate results while, if necessary, the array is reconfigured, the intermediate resinlts are then stream^ 
ing back for further processing and so forth. Here, a memoiy strip on one side and/or memory part on one 
side can be preloaded with data by a load-configuration in one array part whUe on Ike memory part on the 
other side of the logic cell.field data are written oht using a store-configu-ration. It is to be noted tiiat such a 
sunultaneous load-Zstore^way of data processing is possible even without spatial distrilDution and/or separa- 
tion of memoty areas in which data are retrieved and/or in which data are stored 

It is possible to effect the data loading fi*om.a cache and/or into a cache. It is advantageous, if the external 
commimication to large memory banks is handled via a cache controlling unit v/ithout having to provide for 
separate circuitry within the data processing logic cell field in that the access iii a writimg or reading nianner 
to cache-memory-rheans typically is very fast and has a small latency (if any), and in tliat also typically a 
CPU-Uiut is, here typically via a load-/store-unit, coupled to the cache so that access to data and an ex- 
change tiiereof between the CPU-core and the data processing logic cell fields can be effected fast, block- 
wise and such that not every single datum needs to be transferred via a separate instrucstion that must be 
fetched for example hy the opcode-fetcher of the CPU and processed therein. 

This cache-coupling is also highly preferred compared to the coupling of the data proc essing logic cell field 
to the ALU with the CPU via registers, if those registers communicate only via a load-Vstore-unit with the 
cache as is known in the non-PACT prior art. 
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It is possible to provide for a further data connection to and/or from the load-/st:ore-unit of the or one of thiat 
sequential-CPU-units connecting to the data processing logic cell fields and/or "their registers. 

It is to be mentioned that it is possible to adress -units via separate inpuWoutput ports of the data processing 
logic cell field which can in particular be provided as a VPU or XPP and/or to adress that data processing 
logic cells via one or more multiplexers downstream a single port 

It is also to be mentioned that beside of the block-wise and/or streaming and/or random mode access to 
cache areas in a wi-iting and a reading manna- and/or to the load-Zstore-miit ancL/or the per se known con- 
nection to the registers of a sequential CPU it is also possible to provide for a connection to an external 
mass memory sucli a RAM, a harddisc or any other data exchange or input or output port such as an an- ; 
tennia and so forth. It is possible to provide for separate ports for flie access to several of such units and/bi- 
memory means. It is also to be mentioned that suitable drivers, signal conditioii.ing circuitzy and so forth; 
should be provided. Furthermore, it is to be mei&tioned that in particular, althpo^ not exclusively for ftie 
handling of a data stream streaming into that data processing logic cell field ancj/or out of that data pioo: 
essing logic cell fields the logic cells of that field can comprise ALUs or EALUs respectively which can! 
have at their input- and/or output ports short, fiire-granularly configurably FPG-A-like circuitries, for exain- 
ple to cut out 4-bit-bl6cks out of a continous data stream as is necessary for example for an MPEG-4- 
decoding. This is advantageous on one hand, Lf a data stream is to be inputted.into the cell and is to be 
processed or preprocessed without blocking larg^ PAE-units. It is m particular of advantage, if the ALU is 
provided-as a SINfD-ALU; here, for example, a very broad data word havmg for example 32-bit-data-wi±th 
is split via an FPGA-like stripe in firont of the SIMD-ALU splitting tiie broad e. g. 32-bit-data into eight 
data words having fi>r ^cample 4-bit-data-width that can then be processed parallelly in the SIMD-ALU 
inmasing the overall performance of the systexn significantly provided &at the» respect of q)plications are 
needed; 

Furthermore, it is noted that when reference is being had to FPGA-like pre- or post that structures it is not: 
absolute ne- 
cessary to refer to 1 -bit-granular devices; in contrast, it .would be possible to provide fin^-granular stmc- 
tures of a for example 4-bit- instead of die hyper-fine-granular l-bit-structures. In other words, the FPGA- 
like input- and/or output-structures, in firont of or data downstream of that ALU— unit, in particular SIMD- 
ALU^units are configurable in such a way that always 4-bit-data-words are processed. It is also possible to 
provide for a cascading, so that for example inconiing 32-bit-data-widdi words are separated into 4-bit paorts 
by 8-bit FPGA-like structures in sequence of each other, then processing that four 8-bit data words in foirr 
FPGA-like 8-bit-width structures, then providing for a second stripe of 8 separate 4-bit-wide FPGA-like 
structures and, if necessary, provide for example sixteen separate having parallel 2-bit FPGA-like struc- 
tures. If this is the C£ise, a significant reduction of the overhead compared to a l^yper-fine-granular 1-bit 
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FPGA-like structure can be achieved. This allows to significantly reduce the configuration manory and so 
forth thus sa-ving on silicon area. 

It is to be noted that all of that coupling advantages are achievable using data block streams via a cache; 
however, it is preferred in particular if that cache is built slice-wise and if an access onto several sUoes can 
take place simultaneously, in particular onto all slices at the same time. It is advantageous i1^ as will be 
understood, the data processing logic cell field (XPP) and/or the sequeixtial CPU and/or CPUs have ±o proc- 
ess a plurality of threads, be it by way of hyper-threading, multi-tasking and/or multi-threading. It is also 
preferrable to provide cache-storage meians with slice access and/or slice access enabling control m&ans. 
For example, every single thread can be assigned a separate slice thus allowing that on processing thiat 
tiiread tiie lespectiye cache areas are accessed on the re-enty of the group of codes to be processed. How- 
ever, it is to be undmtood that the cacixe needs not necessarily be s^>arated into slices and that, even if the 
cache is separated into slices, not every single thread must be assigned a. separate slice. However, it Is to be 
noted that this is a higly prefmed method. Furthennore, it is to be notecl that there may be cases whereui 
not all cache areas are used sunultaneoiuly or temporarily at a given tine, fostead, it is to be expected that 
ia typical data processmg applications such as in hand-held mobile telephones, laptops, cameras and so 
forth there will exist periods diuring which not the entire cache is needed Thus, it is higly preferred if cer- 
tain cache-areas can be separated from the power source in such a way tJiat the energy consumption is sig- 
nificantly reduced, m particularly close to or exactly to 0. This can be achieved by power supply separation 
means adapted to separate cache slices from power. The separation can. either be effected by a down— 
clocking, separation of clock-lhies and/or the overall separation of apower supply. It is in particular possi- 
ble to.provide for such a separation for every single cache slice, for example by an access identific^on 
adapted to identify whether or not a thread, hyper-thread, task or the like is currently assigned to a respec- 
tive cache slice. In case die access identification means indicates and/or detects that diis is not the ca.se, 
typically a separation of die slice from at clock-line and/or even die power-line can be envisaged as possi- 
ble. It is also to be noted that on repowering-up after a separation from pjower it is possible to immediately 
access the cache area, thus, no significant delay by switching an ON or OF of the power is to be expected, 
as long as the hardware is iniplemeated with current semiconductor techtnologies. 

A further particular advantage of the prevent invention residesip the faot that although the transfer of data 
and/or operands is possible in a block-wise manner,.no particular balanoing is needed to ensure that exactly 
the same tirnes of e>cecuti6n. of data processing steps in.the sequential CIPIJ and the XPP and/or other- data 
processmg logic cell fields are achieved. Instead, die processmg is freqirently mdependen^ in particimlar in 
such a way that the sequential CPU and the data processing logic cell field can be considered as separate 
resources by a scheduler. This allows for the immediate implementatioa of known data processing pro- 
grams splitting technologies such as multi-tasking, multi-threading, hyper-therading. The advantage that a 
data path balancing is not necessary lia& as a result that for example in a sequential CPU a number of pipe- 
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line stages may be included, cloclc frequencies and/or schemes of clocking may be achieved in a different 
way and so forth. It is a particular advantage if asynchroneous logic is needed. A fiuth^ advaoitage of the 
present invention results in that by configuring a load- and a store-configuiatibn into the data processing 
log;ic cell fields the data inside the field can be loaded into tiiat £eld or out of that field which is no more 
controlled by the clock firequency of the CPU, the perfoimance of the opcode fetcher and so forth. In other 
words, the opcode fetcher is no more bottle-necking die data throughput to the data logic cell field without 
having an only loose coupling. 

In a particularly preferred embodiment of the invention it is possible to use the known CT or <ZM (com- 
moixly employed in tiie XPP-unit, also given the fact that with one or more, even hierarchically arranged 
XPP-fields having in preferred embodiments their own CTs wh.ile simultaneously using one or more se- 
quential CPUs) here as a quasi hyper-threading hardware-management unit having the inherent advantage 
diat known technologies such as FILMO and so forth become applicable for the hardware support and 
maaagement of hyper-threading and so fordi; it is alternatively and/or in particular in a hierarchically ar- 
rangement also possible to provide the configurations fiom the opcode-fetcher of a sequential CPU via the 
coprocessing interface allowing to uistantiate an XPP and/or data processing logic cell field call by the 
sequential CPU to effect data processmg on the data processing logic cell field. Cache coupling and/or load 
and/or store configurations providing adress generators for loading and/or storing of data into that data 
processing logic cell field or out of diat field will provide for th.e data exchange of the XPP. Lm other words, 
the coprocessor-like coupling to the data processmg logic cell field is enabled while simultaneously a data 
stre£im-:lik&dataJoading is effected via cache- and/or I/O-port coupling. 

It is to be mentioned that the method of coprocessor coupling, that is the indicated coupling of the data 
processmg logic cell field will typically result in the scheduling of that logic cell field taking place on the 
sequential CPU and/or a supervising scheduler unit and/or a respective scheduler means, hi such a case, the 
threading control and/or -mahage-ment practically takes place on the scheduler and/or the sequential CPU. 
Although this is per se possible, this will not be necessarily the case where the most easy implementation of 
the invention is sought^-The^ata-processing logic cell field can be called m a conventional manner, such as 
has heen the case in a standard coprocessor such as a combination of 8086/8087. 

It is to be mentioned that in the particularly prefened embodmient, infiependent of the way ojf its configu- 
ration; be this a coprocessor inter&ce, the configuiation manager acting as scheduler at the same time or m 
any other way it is possible to adress memory within or in imnnediate yicmity of the data processing logic 
cell fields or under its management, m particular memory within the XPP-architecture as is Icxiovra by the 
applicant, RAM-PAEs or other; accordmgly managing internal memories such as a vector register is sug- 
gest:ed, that is the data volumes loaded via the load configuration are stored vector-like in vector registers m 
the internal memory-cells, thereafter accessing said registeis after loading and/or activating o»f a new con- 
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figuration for effecting tiie actual data processing. (It is to be noted that a data processing configuration can 
be referred to as one configuration even in case that several distinct configurations to be processed 
simultaneously, one after the o&er or m a wave-like modus.) 

Here, a vector register can be used to store results and/or interniediate results in the internal or internally 
managed memory cell elements. The vector register-like accessed memory means in the XPP can be used 
also, after reconfiguration of die processing configuration by loading a store configtoration m a suitable 
manner in a way that takes place again in a data-stream-like manner, be it via an I/O-port directly streaming 
data mto external memory areeis and/or, as particularly preferred, into cache areas ow out of these which 
then can be accessed at a later stage by the sequential CPU and/or other configurations executed on the 
other data processing lo^gic cell field, in particularly a data processing logic cell field having produced said 
data m'ihe first place. 

In a particulary preferred embodiment, at least for certain data processing results aad/or intermediate re- 
sults, the inemory and/or memoiy register means into, which that data processed are to be stored, not an 
internal memory, but instead a cache area having access reservation, in particularly cache areas which are 
organized in a slice-wise manner can be used. This can have the disadvantage of a Larger latency, in par- 
ticular if the paths between the XPP and/or data processing logic cell fields to or fi^om the cache are of 
considerable length sucli that signal transmission delays need to be considered StilL this allows to avoid 
additional store configurations. It is also to be noted, t;hat such a way of storing data, in a cache area be- 
comes on the one hand possible by placing the metnaxy, mto which data are stored, physically close to the 
cache controller and embodying that memory as a cache, but that ahematively and/or additionally the pos- 
sibility exists to submit a part of a dm processing logic cell field memoiy area, internal memoiy, such as e. 
g. in the ''RAM over FAE^-case under the control of me or several cache-memory «ontroller(s). 

This is advantageous if the latency in storing the data processing results are to be kept small in storing data 
processing results while latency in accessing the memory area serving as a quasi-ca.che to other units will 
not be too significant iix other cases. 

It is also to be mentioned that ah embodiment is possible in such a way that the caclie controller of the 
known sequential CPU adresses as a cache a memoiy area that without serving for miie purpose of data 
exchange with a data processing logic cell field is physically placed onto that data processing logic cell 
field and/or close to that field. This is advantageous ixi that, if applications are run onto the data processing 
logic cell fields having a veiy small local memory need and/or if only few other configurations compared to 
the overall amount of loemoiy space available are needed, these memory areas can be assigiied to one or 
more sequential CPUs as cache or additional cache. It is to be mentioned that if In such a case the cache 
controller may be adapted for the management of a cache area having a dynamically^ varying size. 
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A dynamical cache-size management and/or (fynamical cache maix.agement size means for the djmamical 
cache management may patticulari^^ take mto account the woik lo^ on the sequential CPU anci/or the data 
processing logic cell fields. In other words, it could be analyzed for example how many NOPs in a given 
time unit are executed on the sequential CPU and/or how many coxiflguratioDs are preloaded in. the dy- 
namically reconfigurable field in the memory areas provided therefore, so as to enable fast reconfiguratibn 
(be it by fbe way of wave=recoDfigirratiX)n or in any other way). Th^ dynamical cache size or cache size 
management disclosed herewith is, in a highly preferred manner, runtime dynamical, that is, th« cache 
controller controls a momentary cache size respectively that can b^ changed from clock-cycle to clock- 
cycle or from one group of clock-cycles to the other. It is also to be noted that die access management of a 
data processmg logic cell field with access as intemal memomory such as vector register is possible. While 
as previoasly discussed a configuration management unit can be provided, it is now to be explicitly noted 
that such units and their way of operatioii allowmg m particular th^ preloadmg or coofiguratioas currently 
not yet needed pan be used very easily to effect the multi-task operation and/or hyper-threading and/or 
multi-threading, in particular for taslc- and/or thread- and/or hyper-thread switches. Here, it is also noted 
that durmg Jhe runtime of a Aread or a task it is possible to preload configurations for different tasks and/or 
threads and/or hyper-threads into the PAE-array. This then allows to preload configurations for a different 
task an/or thread if the current threaci or task can not be executed, for example because data must be waited 
for, be it tlxat they have not yet been Teceived, for example due to latencies, be it because a resource is 
blocked by another access. In case o± the configuration preloading for a different task or thread, a switch or 
change becomes possible without the disadvantage of a timing ov^liead because due to the for example 
shadow-like loaded configuration execution. 

It is in principal possible to use this technique also m cases where time most likely continuation of an execu- 
tion is predicted and a prediction is missed. However, this way of operation will be most preferred in cases 
fi^ of predictions. When using a pujre sequential CPU and/or several pure sequential CPUs, the configura- 
tion manager thus also acts as and realizes a hyper-threading manag'fBment hardware. Reference is being 
had toDE 198 07 872, WO 99/44147 and WO 99/44120. It can be considered as sufficient, in particular in 
case where the CPU and/or several sequential CPUs shall have a hyper-threading management to keep 
partial circuitry elements such as the FILMO, described in the docurnents refmed to. In particular, the use 
of the.confijg^tion manager describ^sd therein with and/or without FILMO for use with the hyper- 
threading management for one and/or more purely sequential CPUs ^ith. or without coupling to a data 
processmg logic cell field is disclosed herewith and claimed as inventive^ although it is also to be noted that 
not in all cases where new and inventive features are disclosed in the present application these filatures are 
explicitly referred to as being inventive. 
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It is also to be noted that the plurality of CPUs caxi be realized with known tecGhniques such as known for 
example from DE 102 12 621, PCT/EP 02/10572. It is also to be noted that DE 106 5 1 075, DE 106 54 
846, DE 107 04 728; WO 98/26356, WO 98/29952 und WO 98/35299 disclose how to implement sequenc- 
ers having ring- and/or random-access memcny means in data processing logic cell fields. 

It is to be noted that a. task-, thread- and/or hyper-thread switch respectively can be effected with the known 
CT-technolgy such tti^t perfonnance-slices and/or time-slices are assigned to a software implemented oper- 
ating system scheduler by the CT, during which slices it is determined which parts of tasks and/or tiireads 
are subsequently to be executed provided that resources are available. 

An example is given as follows: First, an adress sequence is generated for a first task during which the 
execution of a load configuration loads data from a cache memoiy coupled m-th to the data processing logic 
cell field in the described manner. As soon as the data are present, the execution of a second, the actual data ' 
processing configuration can be started. This configuration can be preloaded sls well, since it is certain that 
this configuration is to be executed provided that no mterrupts or the like caus« task switchs. In conven- 
tional processes there is now known the problem of the so called cache-miss, wherein data are requested 
ihsit are not yet available in the cache. If such a case occurs m die coupling according to the present inven- 
tion, it is possible to. switch over to another thread, hyper-thread and/or task, ttiat has in particularly been 
previously detennined as the one to be executed next by the m particular sofhvare implemented operating 
systems schedular aad/or another hard- and/or sofltware implemented unit opeirating accordingly and has 
thus been preloaded in an available configuration xnemory of the data processLng logic ceU field, m par- 
ticular preloaded in tlie background during the execution of another configuration, for example the load 
configuration which bas effepted the loading of data that are now waited for. 

It is to be noted that it is possible to provide for separate configuration lines, thiese bemg e. g. separate fiom 
communication lines used m the connection of the in particular coarse-grahukur data processuig logic cells 
of the. data processing logic cell field. Hien, if the configuration to which due t:o the task-, thread- and/or 
hyper-thread switch processing has been switched over, has been executed, an<i in particular has been in the 
preferable non-dividable, uninterruptable and hence quasi atomar configuratioai executed until its end, a 
further other configuration as predetermuied by th.at scheduler, in particularly said operating system-like 
scheduler is executed and/or a configuration, for which the assigned load configuration has been executed. 
Prior to the execution of a processing configmtion, for which a load configiiration has been executed pre- 
viously, a test can be made whether or not the respective date have been streancied into the array, e. g. 
checking if the latency time which typically occurs has lapsed and/or the data are actually present 

In other words, latency tunes which occur as configurations are not yet preloacded, data have not yet been 
loaded and/or data have not yet been stored, are bridged and/or covered by executing threads, hyperrthreads 
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and/or tasks which have been preconfigured and which process data that are akeady available or can be 
written to resources that are avaflable for writing thereto. In tliis way, latency times are coveored and/or 
bridged and, provided a sufficient number of threads, hyper-tlireads and/or tasks are to be exiecuted, the 
data processing logic cell field will have abnost 100 % load 

la the present system it is possible ta realize a real time system despite the coupling of the array to a se- 
quential CPU and in particular while still having a data stream capability. In order to ensure Teal thne capa- 
bilities it must be guaranteed that incoming data or interrupts respectively signalizmg incoming data are 
reacted upon without exceeding an allowed maximum time. Xhis can be effected by causing " a task switch 
on an interrupt and/or, for example, if the mterrupts have a certain priority, by determining tliat a certam 
interrupt is currently to be ignored^ which has to be determined within a certain tune as >velL A task switch 
in such systems capable of real tuixe piocessmg will thus typi<:ally be possible m one of three ways namely 
either when a task has run for a certam time (watch dog-princ^iple), hot-avaOability of a resource, be it due 
to ^ blockade due to ano&er access or due to latencies and/or in tiie case of the occurrence oif interrupts. 

A way of implementing one of these variants may ensure the real time capability. In a first alternative, one 
resource which is under the control of the CT oder scheduler switches over to processing the interrupt If 
the allowed response time to a certain interrupt is so long tha^ the configuration currently configured can be 
executed without interruption this is uncritical, particular in v^iew of the fact that the interrupt handling 
configuration can be preloaded. Th.e selection of the interrupt handling configuration to be preloaded can be 
carried out by the CT or any other instance. It is also possible, to restrict the runtime of the o onfiguration on 
the resource to which the 'intemq^t processing has been assigned. Reference is being had to PCT/DE 
03/000942. 

If the system has to react fast& iTan mtemipt occurs, it can be preferred to reserve a single i^esource, for 
example a separate XPP-unit or parts of a data processing logic cell field for the execution of mterrupt 
handling routines. In this case, it is also possible to preload interrupt handlmg routines for hxterrupts that 
are particularly critical. It is also possible to inunediately start loading of an ihteriiipt handling routine once 
the interrupt occurs. The selection' of die configuration necessary for a respective interrupt, oan be. effected 
by triggering, wave-processing and so forth. 

It is to be mentioned that by the methods described -it becomes possible to provide for an ms-tantaneous 
reaction to the interrupt by using load-Zstore configurations ic% order to obtain a code-reentraxicy . Here, 
followmg every single or ev»y otber data processing configmration, for example every five or ten data 
processing configurations, a store configuration is executed etxxd then a load configuration acxcessmg the 
very memory arrays in which data have been just written is carried out Then, it only has to fce made sure 
that the memory areasjised.hy.that;.store>configur^tion remain untouched until the configurartion or group of 
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configurations that the preloading has been effected for has been finished by completely executing a fLnther 
store configuration. In such a way of intenaediately carried out load-/stoie configurations and simultaneous 
protection of not yet overaged store-memoxy areas code-reentrancy is generated very easily, for example in 
compiling a program. Here, resource reservation may be advantageous as welL 



Further, a particular preferred embodiment of^ a reaction to an intemipt cor^sists in using interrupt routines 
where code for the data processing logic cell field is forbidden, this embodliment being preferred whetK. one 
of the resources available is a sequential CPU.. In odier words, an interrupt handling routine is execute<l 
only on a sequential CPU without calling data, processmg steps or routines making use of a data processing 
logic cell field. This guarantees tiiiat tiie processing on the data processing logic cell field is not to be inter- 
rupted and ^tim, further processing on that data processing logic cell field <:an be effected following a -task 
switch. Althougji tiie actual interruipt routine does' not comprise any data pmces^ing logic cell field cocie 
such as XPP-code, it can still be i&ade sure th^t at a later time no more relevant to real time processing 
capabilities the data processing logic cell field reacts to an interrupt and/or: a real time request determined, 
to state, information and/or data using the datat processing logic cell field. 
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Refeiiences 



Third major part 

Abstract - Nowadays, the datrapaths of modem 
microprocessors reach their limits by using 
static instruction sets. A way out of this limita- 
tions is a dynamic reconfigurable processor 
datapath extension achieved by integrating 
traditional static datapaths with the coarse-grain 
dj^amic reconfigurable XPP--architecture (ex- 
treme Processing Platform). Therefore, a 
loosely asynchronous coupling mechanism of 
the corresponding datapath , units has been de- 
veloped and integrated onto a. CMOS 0.13 ^m 
standard cell technology from UMC. Here the 
SPARC compatible LEON processor is used, 
whereas its static pipelined iiistruction.datapatii 
has been extended to be configured and person- 
alized for specific applications. This allows a 
various and efficient use, e.g. in streaming ap- 
plication domains like-MPEeMTndigital* filters, 
mobile communication modulation, etc. The 
chosen coupling technique allows asynchro- 
nous concurrency of the . additionally config- 
ured compound instructions, which are inte- 
grated into the programming; and compilation 
environment of the LEON processor. 



6.1 Intrdduction 
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A M ethod for Compiling ffigfa-LeveJ Langaagd Frogcams to a RegonffguraMe X>atarFlovf?ro^xssQr 2 

1 Introduction 

This doament describes a method for compiling a subset of a higti-level programming language (HLL) 
like C orFOKTRAN, extended by port access functions, to a reconzfigurable data-flow processor (RDFP) 
as desoibed in Section 3. The program is transformed to a configioalida of the RDFP. 

TUs method can be used as part of an ^ctended compiler for a hybxidaicUtectureconsi^ 
host processor and a reconfigurable data-flow coprocesscH: The eislmded compOer handles gl full HLL 
like standard ANSI C. It maps suitable program parts like inner loops to the coprocessor and the rest 
of the program to the host processor. Jt is also possible to map s^arate program paits to separate 
configuradons. However, these extensioias are not subject of this document 

2 Compilation Flow 

This section briefly describes the phases of die compilation mediocL 
2.1 Fh^ntend 

The compile uses a standard ^fiontend whidi translates die input pat>gram (e. g. a C program) into an in- 
ternal format consisting of an abstract syntax tree (AST) and syml>ol tables. The frontend also performs 
well-knovm compiler optimizations as constant propagation, dead code elimination, common subexpres- 
sion elimination etc. For details, refer to any compiler construction t0xtt>ook fike [1]. The SUIF compiler 
1^] is an example of a compiler providing such a ficontehd. 

22 Control/Dataflow Graph Generation 

Next, the program is mapped to a control/dataflow graph (CDFG) consisting of connected RIDFP func- 
tions. This pbase is the main subject of tliis document and pres»it»l in Section 4. 

23 Configoration Code G^eration . 

Finally, the last phase direcdy translates the CDFG to configuratioim code used to program the 3RDFP. For 
PACT XPP™ Cores, the configuration code is generated as an NMX. (Native Mapping Langustge) fileT 

3 Configurable Objects and Fimctionafity of a. 

This section describes the configurable objects and fimcitionality otf a RDFP. A possible fanplementation 
of the RDFP architecture is a PACT XIPP™ Core. Here we only describe the minimum requirements for 
a RDFP for this compQation method to work. The only data tyjpes <:onsidered are multi-bit words called 
data ^nd single-bit control signals called events. Data and events are always processed as f^ackets, cf. 
Section 3.2* Event packets are called 1-events or 0-events, dependiaig on tiieir bit-value. 
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3.1 Configurable Objects and Functions 

An RDFP consists of an array of configurable objects and a communication netwodc Eadi object can 
be configured to perfonn certain functions Qisted below). Ic p^orms die same ftmctioQ repeatedly untU 
fbt configuiation is chmged. Tbe anay needs not be completely unifoma» i. e. not all objects need to be 
able to perfonn all functions. H. g., a RAM fimction can be implemented by a specialized RAM object 
which cannot perform any other functions* It i^ also possible to combine several objects to a 'teacro'* to 
realize cotain fimcdons. Several RAM objects can, e.g. , be combined to realize a K^AM function vidth 
larger steerage. 
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RguTB 1: FunctioiK of aai RDFP 

The following functions for processing data and event padoecs can be configured uito aim RDFP. See Fig. 1 
for a graphical representation. 



• ALU[opcode]: ALUs perform common arithmetical and logical operations on data. ALU func- 
tions ("opcodes'*) must be available for all operations used in the HLL.^ ALU functions have two 
data inputs A and Brand-onerdata-output X. Comparators have an event outpat: U instead of the 
data output They produce a 1-event if the comparisoix is true, and a 0-event otherwise. 

^Otherwise programs contaimng operations which do not have ALIJ opcodes in the RDFP must ix excluded from the 
supporcedHLL subset or substimted by '^macros" of existing functions. . 



CooGdential 



wo 2004/01556J 




PCT/EP2003/008080 



A M^bod for Coming Uigb-Level Language Rvgwns to a RecmBgarabk Dsta-How Processor 4 

• CNTi A countCT fancdon which has data inputs LB, UB and INC (\ovf&r bound, upper bound 
- and mcranent) and data output X (counts- value). A packet at event input START starts the 

counts, and event input NEXT causes the generation of the next outpiit value (and output events) 
or causes ttte counter to t^minateifUB is reached. If NEXT is not connected, the counter counts 
continuously. The ouq>ut events U, V; and W have die followmg foncticnaKty: For a counter 
counting N times, N-1 0-events and one 1-event are gen»ated at output U. ^t output V, N 0-evrats 
aie gen^ted, and at output W, N 0-eveftts and one 1 -event are created. Tlie 1 -event at W is only 
cieated after the counter has temiinated» i. e. a N£XT even^ 
packet was ouQmt. 

• RAM[size]: The ILAM function stores a fixed iiinnber of data words ("siz:e*0- It has a data input 
RD and a data output OUT for reading at address RD. Event output ERD signals completion of 
the read access. For a write access, data inputs WR and IN (addiess and value) and data output 
OUT is used Eveat ouqnit EWR signals completion of the write access. lERD and EWR always 
generate 0-events. I^ote tiiat external RAM can be handled as RAM fhncticxns exacdy like intonal 
RAM. 

• GATE: A GATE syiichroni2»s a data packet at iaput A back and an evait packet at input E. When 
both inputs have arrived, they are both consumed. The data packet is copied to output X, and the 
event p^et to output U. 

• MUX: A MUX function has 2 data inputs A and B, an event input SEL, and a data ou^ X. If 
SEL receives a 0-event, input A is copied to output X and mputBdiscanled. Fora l-eve]it,Bis 
copied and A discarded. 

• MERGE: A MERGE function has 2 data Inputs -A and B, an event input SEL, and a data output X. 
If SEL receives a 0-event, input A is copied to ovQ>ut X, but input B is nor discarded. Ihe packet 
is left at the input B instead. For a 1-event, B is copied and A left at tjie input. 

• DEMUX: ADEMOXfimctionhasone datampiit A,aneventinputSEU and two data outputs X 
and Y. If SEL receives a O-evrat, input A is copied to output X, and ijo packet is created at output 
¥• For a i-event, A is c(qpied to Y, and no padcrt: is crea^ 

• MDAIA: A MDATA function multiplicates.data packets. It has a data input A, an event input 
SEL, and a data oatput X. If SEL receives a 1-event, a data packet at A is consumed and copied 
to output X. For all subsequent 0-event at SEL. a copy of the mput data packet is produced at the . 
output vidthout consuming new packets at A. Only if anotfa^ 1-event anives at SEL, die next data 
packet at A is consumed and copied.^ . 

• INPORT[name]: Receives data packets from outside the RDFP flirough input port "name** and 
copies them to data output X. If a packet was received, a 0-event is produced at event output U, 
too. (Note that this function can only be configured at special objects connected to external busses.) 

• OUTPORT[name]: Sends data packets received at data input A to die outside of the RDFP through 
output port *'narae**. If a packet was sent, a 0-event is produced at event output U, too. (Note that 
this function can only be configured at special oljgects connected to external busses.) 

Additionally, the followkig functions manipulate only event packets: 
^NotediaiduscanbeimpieinentedbyaMERGEwithspedal properdesonXPF^. 
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• O-FILIER, 1-FILTER: A FILTER has an iiiput E and an output A O-FD-TER copies a 0-e^ent 
ftom E toU, tut 1-EVENTs at E are discarded. A 1-FILTER copies 1-events and discaids 0-events. 

• INVERTER: Copies aU events fiom input E to oa^U but imieits its va^ 

• O-CONSTANT, 1-CONSTANT: 0-CONSTANT copies all eveits fiom input E to ouQmt U,. but 
changes tfa^n all to value 0. l-CONSTANT changes aU to value L 

• ECOMB: Combines two. or more inputs El, E2, E3..., producing a packet at ou^ut U. Hie oirtput 
is a 1-event if and only if one or more of the input packets are 1-events. Qogical or). A packet must 
be available at all inputs before an ouput t>acket is produced.^ 

• ESEQ[seq]: An ESEQ generates a sequence ^^seq" of events, e^g. "XXX)!**, at its output U. If it 
has an input START, rae entire sequence is generated for each event packet arriving at U. The 
sequence is only repeated if the next event axiives at U. However, if START is not connected, 
ESEQ constantly rq)eats the sequence. 

Note that ALU, MUX; DEMUX, GATE and ECOMB functions bdiave JLke theb equivalents in classical 
dataflow machihes [3,4]. 

3JZ Packet-based Communication Network 

The conununicadoxi network of an RDFP can connect an ou^ts of one object (i. e. its respective fimc- 
tion) to the input(s) of mie or several other objects. This is usually achieved by busses and switches^ By 
placing the fimctioiis properly on the objects, many functions can be cormected arbitrarily up to a limit 
imposed by the device size. As mentioned above, all values are commanicated as packets. A separate 
coimnunication netwoik exists for data and event packets. The packets synchrraize the functions as in a 
dataflow machine with acknowledge [3], L e., the fimction only executes when all input packets are a^^ail- 
able (apart from the non-strict exceptions as described above). The functdon also stalls if the last oixtput 
packet has not been consumed. Therefore a data-flow graph mapped to an RDFP self-synchronizes its 
execution without the need for external control. Only if two or more function, outputs (data or event> are 
connected to the same function input to l connection**X the self-syachronization is disabled.^ The 
user has to ensure that only one packet anives at a time in a correct CI>FG. Otlierwise a packet mdght 
get lost, and the value resulting from combining two or more packets is tindeiined. Howev^, a fimction 
output can be connected to many function inputs C'l to N connecdon**) without problems. 

There are some special cases: 

A fimcdon input can be preloaded with a distinct value during ccHxfiguradon. This packet is con- 
sumed like a normal padcet coining from another object 

• A funcdon input can be defined as constant, b this case, die packet at the input is repiodiaced 
repeatedly for each funcdon execution. 

^Note that this function is implemented by the HAND operator on the XPP™ . 

*Note that on XPPTM Cores, a •*N to 1 connection" for events is realized by the EOR ftoncdon, and for data by just assigning 
several ou^ts to an input. 
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An RDFP requires register delays in tbe dataflow. Otherwise veiy long combinational delays and asyn- 
chronous feedback is possible. We assume tbat delays are inserted at the inputs of socne functions (like 
for most ALUs) and in. some routing segments of tbe cononTudcation network. Note diat registers diange 
the tiliiing, but not tbe funcdonality of a canect 



4 Configuration Generation 

41 Language Definition 

liie following HLL features are not supported by die method described here: 

• points operations 

• library calls, operating system calls (including standard I/O functions) 

• recursive funcdon calls (Note that non-recursive fuacdon calls can be elimmated by function in- 
linihg and therefore are not considered here.) 

" ■ ■ i ■ 

• An scalar data types are converted to type integer. Intieger values are equivalent to packets in 
th&RDFP. Arr^fs (possibly nauld-dimensional} aie ftie only conqx>site data types considered. 

The following additional featuxes are supported: 

INPORTS and OUTPORTS be accessed by the HLL. functions getstream(name^ value) and put- 
5tream(name, value) respectively. 

42 Mapping of High-Level Language Constructs 

This method converts aHLL program to a CDFG consisting of die RDFP functions defixmed in Section 3.1. 
Before the processuig staits, all HLL program arrays are nB.^iped to RDFP RAM functions. An array x 
is mapped to RAM RAM(x). If several airays are mapped to die same RAM, an offset: is assigned, too. 
The RAMs are added to an initially empty CDFG. There must be enough RAMs of sufXcient size for all 
program axrays. 

The CDFG is generated by a traversal of die AST of die HXL program. It processes the program state- 
ment by statement and descendls into the loops and conditioaial statements as apprqpriafje. The following 
two pieces of information are updated at every program poiaat^ during die traversal: 

• START points to an event output of a RDFP function. Iliis output delivers a O-event whenever 
the program execution reaches this program point A^t the beginning, a 0-CONSTANT preloaded 
widi an event input is added to the CDFG. (It delivers a 0-event immediately after configuration.) 
START initially points to its output This event is used to start the overall program execution. The 
STASTfiew signal generated after a program part has finished executing is used as new START 
signal for die following program parts, or it signals termmation of die entire program. The START 



^Insi^pxoffam, program points m between two statements or before cliebeginnin program component 

:e a loop or a conditional statement. 
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events guarantee diat die ^cecutiancml^ of the origjnal program b maintained wherever the data 
dependencies alone are not suffidraL This sdieduling scheme is shnilaur to a one-hot controller 
for digital hardware^ 

• VARLIST is a list of {variable, fimctian-^utput} pairs. The pairs map integer variables or array 
elements to a CDFG iuncdon*s o^put The first pair for a variable in VARLIST contains die 
ou^ut of die fonctioii which produces the value of this variable valid at tbe cuirent program point 
New pairs are always added to die front of VARLIST. The expression YARDEF(var) rtf ers to the 
,/imcii0n-^i<i;pitf of tbe first pan: widi iwi^ 

The followmg subsectioiis systamadcally fist aD HLL program components and describe how diey are 
processed, diereby altering die CDFG, ST^RT and VARLIST. 

4J2A latter Expr^cHDS and Assignments 

Stiaii^t-lme code without array accesses can be direcdy mapped to a data-flow graph. One ALU is 
allotted for each operator in the program. Because of the self-syndironization of the ALUs, no explicit 
control or scheduling is needed. Therefore processing these assignments does not access or alter START. 
The data dependences (as they would be exposed in the DAG rqnesentation . of die program [1]) are 
analyzed dirou^ the processing of VARLIST. These assignments synchnmize themselves dirou^ the 
data-flow. The data-dnven execution automatically exploits the available instruction level paraUelism. 

All assignments evaluate the right-hand side (RHS) or source expression. This evaluation results in a 
pointer to a CDFG object^s output (or pseudo-object as defined below). For integer assignments^ the 
left-hand dde (LHS) variable or destination is combined with die RHS residt objea to form a new pair 
{I^S.iesuh(RHS)}wMcli is added to die front of VAJRIJST. . 

The ^plest statemrat is a constant assigned to an integ^'' 

a = 5; . 

It doesn't change the CDFG, but adds {a, 5} to the ftont of VARLIST.. The constant 5 is a **pseudo- 
object" which only holds the value, but does not refer to a CDFG object Now VARDEF(a) equals 5 at 
subseqent program points tefore a is redefined. 

Integer assignments can also conibine variables already defined and constants: 
1> « a '* 2 + 3; 

In the AST, the RHS is ahready converted to an expression tree. This tree is transformed to a combmation 
of old and new GDFG objects (which are added to di& CDFG) as follows: Eacb operator (internal node) 
of the tree is substituted by an ALU with the opcode corresponding to the operator in the tree. If a leaf 
node is a constant, the ALU's input is directly connected to that constant If a leaf note is an integer 
variable var, it is looked up in VARLIST. i^ e. VARDEF(var) is retrieved. Then VARDEF(var) (an output 
of an aheady existing object in CDFG or a constant) is connected to the ALU's input The output of die 
AJLU corresponding to the root operator in die expression tree is defined as the residt of the RHS. Finally, 
a new pair {UHS, result(RHS)} is added to VARIJSX. If the two assignments above are processed, die 

*This meihdd of using a VARLIST is adapted fix)m the Transnapgrifier C compiler [5]. 
''Note that we use C syntax for the foUowing examples. 
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CDFG with two ALUs in Fig. 2 is created"^ Outputs occuiring in VARUST are labeled by Roman 
numbers. After these two assigmnents. VARLIST = l{b. I}, {a, 5}]. CHie fitoiil of the list is on ibe left 
side.) Note that aO inputs connected to a constant (whether direct from the expression tree or retneved 
fiura VARLKT) must be defined as constant Inpute defined as consents have a small c next to die input 
arrow in Fig. 2. 

A22 Conditioiiailnt^Assignmaits 

For conditional if-then-else statements containing only integer assignments* objects for condition eval- 
uation are created first The object event output indictating the coiidition result is kq>tfi)r choosing 
the coiiBCt branch result later. Next, bofli branches are processed m parallel, using separate copies 
VARUSTl and VARLIST2 of VARLIST. (VARLIST itself is not changed.) FmaUy, for all variables 
added to VARUSTl or VARLIST2, a new entry for VARLIST is created (combination phase). The valid 
definitions ftom VARLBTl and VARLIST2 are comWned with a MUX flmction. and the correct uiput 
is selected by condition result For variables oiily defined in one of the two branches, the multiplexer 
uses die result retrieved ftom die original VARLIST for the odier branch. If die original VARLIST does 
not have an entry for this variable, a special "undefined" constant value is used However, m a function, 
any conect program this value will never be used. As an optimization, only variables five [1] after flie 
if-flien-dse structure need to be added to VARUST in die comWnadon pha^ 

Consider the foDowmgexanqde: i ■ 

i = 7; 
a = 3; 

if (i < 10) { 
..a = 5; 
c = 7; 

1 

else { 

c - a - 1; 
d - Of 

} 

Fig. 3 shows the renting CDFG. Before die if-dien-else cqustruct, VARLKT = [{a, 3}, {i, .7}3. Afla: 
processmg die branches, for die dien branch. VARUSTl = [{c 7}. {a. 5}, {a. 3}. {i. 7}], and for flie 
else branch. VARlIsK = [{d. 0}. {c. I}, {a. 3}, {i. 7}]. After combmation. VARUST = [{d. D}. {c. 
m}. {a. IV}, {a, 3}, {i. 7}]. 

Note that case- or switch-statements can be processed, too, smce they can - wifliout loss of genaaUty - 
be converted to nested if-tb»i-else statements. 

Processing conditional statements fliis way does not require explicit control and does not change START. 
Both branches are executed m parallel and synchronized by die data-flow. It is possible to pipeline die 
dataflow fo r optimal daou^qwit: — : 

■Note that the input and output names can be deduced from their position, of. Rg. 1. Also note that the compiler front- 
end would normal^ have substinited.Ae second assignment hy b = 13 (constant propagation). FW the simphaty of this 
explanation, no fiontend optimizations are con^dered in this and the Mowing escamples. 

'Definition: A variable is five at a program point if its vahie is read ai a statement reachable from here without iniemiediate 
Tedefiniti(». 
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4.23 General Conditional Statements 

Conditional statements containing eitha: aiiay accesses (cf. Section 42 J below) or inner loops cannot 
I be processed as described in Section 4^2. Data packets must only be sent to flie active branch. This is 
I lachieved by the impl«nentation shown in Rg. 8» simUar to the method presented in [4]. 

I A dataflow analysis is performed to compute used sets use and derfined sets def [1] of bodi branches.'^ 
I For die current VARUST entries of all variables in IJV = use{^ienbody) U def(thenbody) U 
use{elsehody) U def{dsebody) U U3e{header\ DEMUX functions controlled by the IF condition are 
inserted. Note that anows witii dojible Imes in Fig. 8 denote connections fcM* dl variables in IN, and die 
shaded DEMUX functira stands for several DEMUX functicans, one for each variable in IN. The DE- 
MUX functions forward data pactets only to the selected branch. New lists VARLISTl and VARLIST2 
are compiled wiUi the respective outputs of these DEMUX functions. The then-branch is processed with 
VAKLISTl, and the else branch widi VARLISTl Finally, die output values are combined. OUT con- 
tains die new values for die same varid>les as m IN. Since only one branch is ever activated diere will not 
be a conflict due to two packets airiving simultanuously. The combinations win be added to VARUST 
after die conditional statement If die IF execution shall be pipelined, MERGE opcodes for die output 
must be inserted, too. They are controlled by die condi ion like die DEMUX functions. 

Tte foDowing extension widi rcspU to [4] is added (d 

tion as mentioned above witfi START events: The START mput is ECOMB-combined with die condition 
output and connected to die SEL input of die DEMUX functions. The START inputs of tiienbody and 
elsebody arc generated from die ECOMB output sent'timough a 1-FILTER and a 0-CONSTANT" or 
; duough a 0-FILTER, respectively. The overall STARTnew output is generated by a simple "2 to 1 
I connection** of flienbody's and elsAody's STABTnew outputs. Witfi diis extension, arbitrarily nested 
1 conditional statements or loops can be handled widiin dienbody and elsebody. 

4.2.4 WHILELoops 

WHILE loops are processed^unilarly^to-die scheme presented in [4], cf. Fig. 9. As in Section 423, dou- 
ble line connections and shaded MERGE and DEMUX functions represent duplication for all variables 
in IN. Here IN = iJtse{whilebody) U defiwhUebody) U U3e{header). The WHILE loop executes as 
foUws: In die first loop iteration, die MERGE functions select all input values from VARUST at loop 
entiy (SEL=0). The MERGE outputs are connected to die header and die DEMUX functions, ff die 
while condition is trae (SELil), die input values are forwarded to die whil*ody, odierwise to OUT. 
The output values of die while body are fed back to whilebody's input via die MERGE and DEMyX 
operators as long as die condition is trae. Finally, after die last iteration, diey are forwarded to OUT. The 
outputs are added to die new VARUST.*^ 

^ I Two extensions widi respect to [4] are added (dotted lines in Fir. 9): 

• L ij f. variable is used in a statement (and hence in a program region containing this statement) if its value is read A variable 

W V is defined in a statement (jar region) if a new value is assigned to it 

"The 0-CONSTANT is required since START events must always be 0-events. 

•>*Note that the MERGE function for variables not live at the loop's bcginmng and the whilebody 's beginning can be removed 
since its output is not used. For these variables, only the DEMUX function to output the final value is required. Also note that 
the MERGE functions can be replaced simple to 1 connecdonsT if the configuration process guarantees that packets firom 
INI always arrive at the DEMUX's input before feedback values arrive. 
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• In [4], the SEL. input of the MERGE functions is preloaded with 0. Hence die loop execution 
be^ns immediately and can be executed only once, bstead^ vre connect the START mput to die . 
MERGE's SEL input ("^ to lcomiecdon''vdth the header oafputXTh^ . 
of the start of die loop execution and to restart it 

• The whilebody's START input is connected to die heada* output, sent dirough a 1-FILTER/O- 
CONSTANT combinadon as abpve (generates a 0-event for eadi locfp iteiadon). By ECOMB- 
combining^hilebo(ty*s STARTnew output widi die header output for die MERGE funcdons* 
SEL inputs, the next loop itendon is only started after die previous one has finished Hie v/bHo 
loop's STARTnem ooiput is generated by filtering the header output for a 0-event 

With diese extensions* aitntFarily nested cond&tional statements or loops can be handled within while- 
bo(fy. 

4.25 FORLoops 

FOR loops are pardcularly regular WHILE loops. Hierefcxe we could handle diem as explained above. 
^However, our RDFP features the special counter function CNT and the data packet multiplicadon func- 
tion MDAFA which can be used for a more efScient implementation of FOR loops* This new FOR loop 
scheme is shown in Fig. 10. 

A FOR loop is controlled by a counter CNT. The lower bound (LB), upper bound (UB), and increment 
QNQ explosions are evaluated like any odier expresdoi^ (see Sections 4.2.1 and 4.2.7) and connected 
to the respective inputs. 

' As opposed to WHEJE loops, a MERGE/DEMUX combination is only required for variables in INI = 
def{farbody\ L e. diose defined in foxbody.^f INl does not contain variables which are only used 
in f(»body, LB, UB, or INC, and does also not contain die loq> index variable. Variables in INI are . 
piocessed as in WHILE loops, but die MERGE and DEMUX functions* SEL mput is connected to 
CNT's W ouQ)ut. (The W output does die inverse of a WHILE loop's header output; it outputs a 1- 
evrat after the county has tenninated. Tliraiefore the mputs of die MERGE functions and the outputs 
of die DEMUX functions are swapped here, and the MERGE functions* SEL inputs are preloaded widi 

, 1-events-) ' 

CNT*s X output provides die current value of die loop index variable. If die final index value is required . 
(live) after die FOR loop, it is selected widi a DEMUX function controlled by CJNT's U event oii^ut 
(which produces one event for every loop it^tion). -r . 

Variables in IN2 = use{forbody) \ def(forhody), i. e. those defined outside die loop and only used 
(but not redefined) inside die loop are handled differentiy. Unless it is a constant value, die variable's 
input value (from VARLIST) must be reproduced in each loop iteration since it is consumed in each 
iteration. Odierwise die loop would stall from the second iteration onwards. The packets are rqxroduced 
by MDATA functions, widi die SEL inputs connected to CNT's U output The SEL inputs must be 

I preloaded widi a 1-event to select die first input Hie l-event provided by die last iteration selects a new 

[ value for die next execution of die entire loop. 

- "Note diat die MERGE functions can be replaced by simple *^ to 1 connecdoAs" as for WHELE loops if Che configuradoo 
process guarantees diat packets firom INI always arrive at die DEMUX's input before feedback values anive. 
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hbe fbUowing control events (dotted lines in Fig. 10) are similar to the WHILE loop extensions, but 
I ampltt CNTs STARTiipit is connected to die loop*s overall START signal. STARTnew is generated 
fitnn CNTs W output, senUhrough-aJ^EILTER and 6-CONSTANT. CNT's V output produces one 0- 
event for each loop iteration and is therefore used as forbocty's START. FmaUy, CNTs NEXT mput is 
comiectBd tofbibody^s STABTnem ou^ut 
[ Fbr pipefined loops (as defined below in SecAaa 42.6); loop iterations are allowed to overlap. Therefore 
a^T *s NEXT mput needs not be connected. Now the counts- produces index variable values and control 
events as fast as they can be consumed. However, in diis case CNTs W output m not sufficient as overaU 
STARTnew output smce the counter taminates before the last iteration's foibody finishes. Instead, 
STARTnew is graerated fran CtTTs U ou^nil ECOMB-combmed with foibody's STARTnew output, 
s&at througii a 1-FILTER/O-CONSTANT combination. The ECOMB produces an event after tmnination 
of eadi loop iteration, but only Ae last event is a 1 -event because only the last output of CNTs U ou^mt 
is a 1-event Hence this evwit indicates fliat tiie last iteration has finished. Cf, Section 4.3 for a FOR loop 
example compQation ^tfa and withom i^pelming. 

As for WHILE loops, these mediods allow to process aibitrarily nested loops and conditional stateinrats. 
The following advantages over WHILE loop imfdementations are achieve* 

• One index variable value is generated by the CNT function each clock cycle, this is fester and 
smaller than the WHILE loop unplementaticHi which aUocates a MERGE/DEMUX/ADD loop and 
a comparator for the counter functionality. 

• Variables in IN2 (only used in forbocty) are reproduced m die special MDATA functions and need 
not go tiirougb a MERGEfl^EMUX loop. This is again faster and smaller dian the WHILE loop 
implmentation. 

42.6 VediOriEatioii and Pipdbiing 

The metiiod described so far generates CDFGs poforming tiie HLL program's func ionality on an RDEP. 
However, tiie program execution is unduly sequentialized by the START signals, h some cases, inner- 
most loops can be vectorized. TTiis means that loop iterations can overlap, leading t<| a pipelined dataflow 
dirough the operators of die loop body. The Pipeline Vectorizatim technique [6] cap be easily applied to 
the compilation mefliod presented here. As mentioned above, for FOR loops, the CflT's NEXT mput is 
removed so that CNT counts continuously, thereby overlapping the loop iterations. 

All loops without array accesses can be pipefined since die dataflow automatically synchronizes loop- 
carried dependences, Ic. dependences between a statement in one iteration and another statement in a 
subsequent iteration. Loops wiA array accesses can be jpipdined if the array (L e. RAM) accesses do 
not cause loop-carried dependences or can be transformed to such a form. In this case no RAM address 
is written m one and read in a subsequent iteration. Therefore the read and write accesses to tiie same 
RAM may overlap. This degree of freedom is exploited in the RAM access technique described below. 
Especially for dual-ported RAM it leads to considerable performance unprovraients. 

4JL7 Array Accesses 

In contrast to scalar variables, array accesses have to be controlled expliciUy m onter id maintam die 
program's correct execution orden As opposed to normal dataflow machine modds [3], a RDFP does 
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not have a angle address space. Instead, the arrays are allocated to several RAMs. Thisleadsto ? 
diffbent approach to haruDing RAM accesses aiKi (?m op new 

\ Toreducethecomplexityofliiecorapflationim)cess.arrayacce^ 

\ 1 uses -pseudo-functions" for RAM read and write accesses. A RAM read function has a RD data mpnt 
(read address) and an OUT data oulput (read value), and a RAM write function has WR and IN data 
inputs (write address and write vahie). Both functions are labeled with the array the access refisrs to. and 
bS have a START event input and a tJ event oulpuL The events control the access order. In Phase 2 aU 
accesses to the same RAM are combined and substituted by a single RAM function as shown m Rg. \. 
•nns involves manipulating the data and event inputs and outputs such that the coirect execution order is 
maintdned and the outputs are forwarded to fl» correct part of the CDFG. 

^hasel Since arrays are allocated to several RAMs. only accesses to the same RAM have to be syn- 
rJchroDized. Accesses to different RAMs can occur concurrcndy or even out of order. In case of data 
^ dependencies, the aaiesses self-synchronize automadcally. Withm lapelmed loops, not even read and 
V write accesses to die same RAM have to be synchronized. This is achieved by maintainmg separate 
' ^ START signak for every RAM or even separate STARTsignals for RAM read and RAM write accuses 
^ ' in Pipelined loops. At the end of a basic block [1]". all STAKTr^ ou^uts must be combmed by a 
EcSlB to provide a START signal for d»e next basic Mode which guarantees that all anay accesses m 
the previous basic block are completed. For pipelmed loops, dris condition can even be relaxed. Only 
afJthe loop exit all accesses have to be completed. The individual loop iterations need not be synchro- 
nized. 

Fust the RAM addresses are computed. The compiler fiontend's standard transfbnnation far may^ 
cesses can be wed. and a CDFG function's output is generated which provides the address. If apphcable. 
die ofiset with respect to the RDFP RAM (as determmed in the mitial mappmg phase) must be added. 
His output is connected to die pseudo RAM read's RD input (for a read access) or to the pseudo RAM 
write's WR mput (for a write access). Additionally. ti» OUT output (read) orIN input (wnte) is con- 
nected. The START mpm is coimected to the variable's START signal, and the UouQ)ut IS used as 

STAJRTnew for ti»e next access. ' 

To avoid redundant read accesses. RAM reads are also registered in VARUST bstead of an mteger 
variable, an array element is used as fcst element of fljc pair. However, a change m a variable occunmg 
in an anay mdex mvalidates the mfbrmation m VARLIST It must tiien be removed from it 
I TTie foBowmg exanq)le wifli two read accesses compUes to tiie intermediate CDFG shown inFig.J2. The 
START signals refer only to variable a. STOPl is die event connection which synchromzesTBeaccKKS. 
Inputs START (old), i and j should be substituted by die actual outputs resulting fiom die program before 
the array reads. 

x = a[i]; 
y = atj] ; 
2 = X + y; 

I Kg, 13 diows the translation of die followmg write access: 

a[i] - x; 

bask bloekis aprasrampart wiA a single enay and a alngle exit point, Le. a jncce of stirfght-llne code. 
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Hiase 2 We now xnrage the pseudo-functions of all accesses to the same RAM and substitute fliem by 
a single RAM function. For aD data mputs (RD for read access and WR and IN for write access), GATEs 
are inserted between the input and the RAM function! Their E inputs are connected to the respective 

I START inputs of the ori^nal pseudo-functions. If a RAM is read and written at only one program point, 
the U ouq>ut of the read and write access is moved to the ERD or EWR output, respectively. For example, 
tiie sm^e access a [il « x; fiomRg. l3istraiKfonnedtDtbefinalCDFGshowninFig.5. 

« However, if sevral read or several write accesses C.e.pseudo-fimcdons fro 
to die same RAM occur, tfie BXD or EWR events are not specific anymore. But a STABTnew event of 
the original pseudb function should only be generated for the respective program point, L e. for die cur- 
rent access. Tbis is adiieved by connecting flie START signals of all other accesses (pseudo-functions) 
of the same type (read or write) with die imerted START signal of the current access. The result- 
mg signal produces an event for every access, bat only for die current access a 1-event This event is 
ECOMB-combined widi die RAM's ERD or EWR output The ECOMB's output will only occur after 
die access is completed Because ECOMB OR-combines its ev«it packets, only the current access pro- 
duces a 1-event Next, tiiis evrat is filtered witii a 1-FILTER and changed by a 0-CONSTANT, resulting 
in a ffTAKCneio signal which produces a 0-event only after die current access is completed as required 

For several accesses, several sources arc connected to die RD, WR and IN inputs of a RAM. This disables 
die self-synchronization. However, since (mty one access occurs at a time, die GATB only aUow 
data packet to arrive at die mputs* 

For read accesses, die packets at die OUT ouq)ut face die same problem as die ERD event packets: 
They occur for every read access, but must only be used (and forwarded to subsequent op^tors) for 
die current access. This can be achieved by connecting die OUT output via a DEMUX fimction* The Y 
output of die DEMUX is used, and die X output is left uncormected Then it acts as a selective gate which 
only forwards packets if its SEL input receives a 1-event, and discards its data input if SEL receives a 
0-event The signal created by die ECOMB described above for die ST ART juw signal creates a l-event 
ixx die current access, and a 0-event odierwise. Usmg it as die SEL mput achieves exacdy die desired 
functionality. 

Hg. 4 shows die resulting CDFG for die first example above (two read accesses), aifter applying die 
transformations of Phase 2 to Fig. 12. STOPl is now generated as foDws: START(old) is mverted. 
"2 to 1 connected'* to STOPl (because it is die START mput of die second read pseudo-fimction), 
ECOMB-combined widi RAM's ERD ouQ)ut and sent dm)u^ die l-FBLTER/O-CONSTANT combina- 
tion. START(new)is generated similarly, but here START(old) is directiy used and STOPl inverted The . 
GATES for input IN (i and j) are connected to ST\RT(old) and STOPl, respectively, and die DEMUX 
fimctions for outputs x and y are connected to die ECOMB outputs related to STOPl and START(new). 

Multiple write accesses use die same control events, but instead of one GATE per access for die RD 
mputs, one GATE for WR and one gate for IN (widi die same E mput) are used The EWR output is 
processed like die ERD ou^ut for read accesses. 

This transformation ensures fliat all RAM accesses are executed correcdy, but it is not v»y fast smce read 
or write accesses to die same RAM are not ppelined The next access only starts after die previous one 
is completed, even if die RAM being used has several pipeline stages. This mefficiency can be removed 
as foUws: 

First continuous sequences of eiflier read accesses or vwite accesses (not mixed) widiin a basic block are 
detected by checking for pseudo-functions whose U output is directiy connected to die START input of 
anodier pseudo-function of die same RAM and die same type (read or write). For diese sequences, it is 
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possible to stream data into the RAM rather than waitmg for the previous access to complete. Fat this 
purpose, a comhination of MERGE functions selects the RD or WR and IN inputs m the order given 
by the sequence. The MERGES must be controlled by iterative ESEQs guaranteeing that the inputs are 
only forwarded in the desired order. Hien only the first access in the sequence needs to be conttolled by 
aGATE or GATES. Similarly, the OUT outputs of a read access can be distributed more effidenlly far 
f a sequence. A comirination of DEMUX functions with the same ESEQ control can be useAjft is most 
fi effident to amnge die MERCS and DEMUX functions as balanced binary treesj 

The STABTneo) signal is generated as follows: For a sequence of length n, die START signal of die 
entire sequence is replicated n times by an ESEQlOO-l] function witii the START nqmt connected to 
die sequence's START. Its ou^ is directly "N to 1 connected" wifli the odier accesses' START signal 
(for stogie accesses) or ESEQ outputs sent through (K:0NSTANT (for access sequences), ECOMB- 
connected to EWR or ERD, respectively, and sent duough a l-FILTER/O-CONSTANT combmation, 
sunilar to die basic mefliod described above. Since only die last ESEQ output is a 1-event. only die 
last RAM access generates a STiUZrasn, as required. Alternatively, for read accesses, die generation 
of die last output can be sent dirough a GAIE (widMut die E mput connected), diereby ptodudng a 
ffTARTnew evMiL 

Fig. 14 shows die optimized version of die first example (Rgures 12 and 4) using die ESEQ-mediod for 
generating STABTnew, and Fig. 6 shows die final CDFG of die following, latter example with duee 
arn^ reads. Here die latter mediod forprodudng die STARTnea event is used. 

x = a[i]; 
y-a[j]; 
z - a[k]; 

If several read sequences or read sequences and sii^e read accesses occur for die same RAM. 1-events 
for detecting die cunwif excesses must be generated fw sequences of read accesses. They are needed 
to separate die OUT-values relating to separate sequences. The ESEQ ouQmt just defined, sent drough 
a 1-CONSTANT, adiieves diis. It is agam "N to 1 connected" to die odier accesses' START signals 
(for smgle accesses) or ESEQ ouqnits sent dnough 0-CONSTANT (for access sequences). The resulting 
event is used to control a first-stage DEMUX which is inserted to select die relevant OUT outpuj data 
packets of die sequence as described above for die baac mediod. Refer to die second example (Figures 
' IS and lj6) to Section A3 for a complete example. 



4.2.8 IqtDtandOalputPorts 

Iiiput and output ports arc processed similar to vectCM- accesses. A read from an toput port is libe an 
arr^ read widiout an' address. The toput data padoet is sait to DEMUX fimctiais which send it to die 
correct subsequmt opoators. The STOP signal is generated to die same way as described above for 
RAM accesses by comlMntog die IMPORTS U ouq)ut widi die current and odier START signals. 

Output ports control die data pactets by GAIEs like array write accesses. The STOP signal is also 
created as for RAM accesses. 
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43 MoreEkamples 

Rg. 7 shows Ae generated a5FG fe the fdlo(«wng for loop, 
a = b + c; 

for (i=0; -i<=10; i++) { 
a = a + i; 
x[il = k; 

1 

]h this example, INI = {a} and IN2 = {A} (cf. Fig. 10). Hie MERGE function for variable a is 
replaced by a 2*1 data connection as mentioned in the footnote of Section 4.2.5. Note that only one 
data packet arrives for variables b. c and k. and one final packet Is produced tea & (ont). foibody does 
not nse a START event since both operations (die adder and flie RAM write) are dataflow-controDed 
by the counter anywa^. But die RAM's EWR output is die foibody's START^ew and connected to 
CNT's NEXT input Note diat the pipelining optimization, cf. Section 42.6, was not appHed here. If it 
' is applied (which is possible for tins loop), CNTs NEXT input js not connected, cf. Hg. 1 1. Here, die 
loop iterations overlap. STARTnew is generated finom CNT's U output and foibody's START,^ {u e. 
RAM's EWR output), as defined at die end of Section 4.23. 
I The following program contams a vectorizable (pipelined) loop widi one write access to amy (RAM) x 
\ and a sequence of two read accesses to array (RAM) y. After the loop, anofeer smgle read access to y 
occurs. 

2 = 0; 

fox (i=0; i<=10; i++) { 
x[i] » i; 

z = z + yli] + yt2*il; 

1 

a = y[kl; . • 

iRg. 15 shows die intermediate CDFG generated before die anay access Phase 2 transfonnation is ap- 
pUed. liie pipelined loop is controUed as follows; Wiflim fbe loop, separate START signals for wnte 
accesses to x and read accesses to y are used. The reentry to the forbody is also controUed by two m- 
dependent signals ("cyclel" and «cycle2'0. For the read accesses. "cycle2" guarantees that die read y 
accesses occur in die correct order. But die beginmng of an iteration for read y and write x accesses is 
not synchionized. Only at loop exit all accesses must be finished, which is guaranteed by signal loop 
finished". The single read access Is comple^ independent of die loop. 

Fig 16 shows die final CDFG after Phase Z Note tiiat "cycler is removed since a single write access 
needs no additional control, and "cycle2'' Is removed since die inserted MERGE and DEMUX functions 
automatically guarantee die correa execution order. The read y accesses are not hidependent anymore 
since diey all refer to die same RAM, and die fimctions have been merged. ESEQs have been aUocated 
to control die MERGE and DEMUX ftmctions of die read sequence, and for die first-stage DEMUX 
functions which separate die read OUT values for die read sequence and for die final smgle read access. 
The ECOMBs, 1-FILTERs, 0-CONSTANTs and 1-CONSTANTs are allocated as described in Section 
4.2.7. Phase 2,*to generate conect control evoits far die GATES and DEMUX fimctions. 
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Claims 

1. Method of simultaneously operating a sequential proces- 
sor and a recorifigurable array wherein data are trans- 
ferred into said reconf igurable array from a data cache 
to said array and wherein results produced in said array 
from said data are written to a destination. 

2. Method according to claim 1, wherein said destination is 
placed upstream the arithmetic unit of said sequential 
processor. 

3. Method according to the previous claim, wherein the data 
output from said reconf igurable array is, at least in 
part, fed into the data path of said processor unit 
downstream the decoding circuitry of said processing 
unit. 

4., Method according to any of the previous claims, wherein 
the arithmetic logic unit of said processor is adapted 
to perform at least one operation on said data outputted 
from said reconf igurable array-. 

5. Method according to any of the previous claims, wherein 
the arithmetic-logic-circuitry comprises circuitry for 
' multiplication and/or division and/or in particular said 
operation performed on said data outputted from said re- 
configurable array comprises a multiplication and/or di- 
vision and/or norming. 



wo 2004/0; 




183 



PCT/EP2003/008080 



Method according to any of the previous claims, wherein 
said data outputted from said reconfigurable array is. 



other then said cache and/or the register of said se- 
quential processing unit. 

Method according to any of the previous claims, wherein 
said destination is downstream of the arithmetic logic 
unit and/or upstream of the cache coupled to said 
processing unit. 



preferably selectably writable to a memory location 



